OpenAI said on June 30, 2026 that after debugging a year's worth of crashes in its data infrastructure, it had identified two unrelated bugs: a hardware corruption on Azure and a race condition that had lurked in open-source code for more than 18 years, according to a company blog post .
June 30, 2026 · OpenAI Engineering
Two Bugs, One Crash: How OpenAI Did "Epidemiology" on a Year of Core Dumps
After a year of unexplained segfaults in Rockset, the data layer behind ChatGPT search and plugins, OpenAI stopped analyzing dumps one by one and classified every crash from the past year — exposing two unrelated causes hiding inside one symptom.
18+ yrs
A race condition latent in GNU libunwind since its first x86_64 version
12+ /day
Return-to-null crashes across the fleet from the race condition
~10k /s
Exceptions per host — heavy backpressure exposed the tiny window
One symptom · two origins
Same segfault, completely different bugs
Bug #1 — Hardware
Where: one physical Azure host
Symptom: 8-byte %rsp offset from a CPU arithmetic error
Fix: denylist the host
Bug #2 — Software
Where: GNU libunwind (18+ yrs old)
Symptom: SIGUSR2 in a 1-instruction window sets RIP to NULL
Fix: switch to libgcc unwinder, upstream patch
Why old code suddenly broke
An 18-year-old race meets extreme scale
High QPS + heavy exceptions
→
SIGUSR2 fires at high frequency
→
Signal hits the 1-instruction window
→
Crash: RIP = NULL
The method that worked
When a few detailed core dumps failed, looking at the whole population won.
By automatically classifying and correlating every crash from the past year, OpenAI separated hardware-caused from software-caused failures hiding inside one symptom — a "build the population dataset" approach offered as a reference for anyone running large-scale C++ or Azure systems.
Continue reading The rest of this article is for AI News Blitz readers. Choose an option below to keep reading.
Already purchased? Sign in ✓ Signed in — this article isn’t included in your current plan.Unlocking the full article…