BREAKING
OpenAI Traces Mystery Crashes
Two Unrelated Bugs Found
Hardware corruption
Bug #1
●
One physical Azure host
●
8-byte %rsp offset
●
Fixed by denylisting host
Software race
Bug #2
●
GNU libunwind, 18+ years old
●
RIP set to NULL
●
Switched to libgcc unwinder
From Core Dumps to Epidemiology
1
Segfault in C++ layer
↓
2
Few dumps fail
↓
3
Population analysis
↓
4
Two causes split
0
yr
libunwind bug age
0
/s
exceptions per host
0
+
null crashes per day
Fixes Pushed Upstream
A Reference for Large C++ Systems
AI NEWS BLITZ
OpenAI says it has finally cracked a year-long string of data infrastructure crashes.