‘Twas a Cosmic Ray!

 A simulation of a cosmic ray shower formed when a proton with 1TeV (1e12 eV) of energy hits the atmosphere about 20km above the ground. The ground shown here is a 8km x 8km map of Chicago's lakefront. This visualization was made by Dinoj Surendran, Mark SubbaRao, and Randy Landsberg of the COSMUS group at the University of Chicago, with the help of physicists at the Kavli Institute for Cosmological Physics and the Pierre Auger Observatory.

Image props: Dinoj Surendran, Mark SubbaRao, and Randy Landsberg of the COSMUS group at the University of Chicago.

Have you every had a bug that was caused by a hardware error? Something that looked like a software bug, but where the code was just fine, yet the machine gave wrong results. Is it always just a blind spot preventing the programmer from seeing a subtle problem in the code, or can it possibly be that some transistor in the CPU just doesn't work right? Does that ever even happen, or is that just the stuff of legends?

As programmers, we are taught that the hardware is (almost) always right. We have to first question our code, then maybe consider 3rd party libraries, the OS, and in the end the compiler. But we practically never say things like "maybe the CPU is taking a wrong branch here" or "maybe the memory has remembered this wrong". In his programming classic "Code Complete", Steve McConnell mentioned an interesting, and controversial, data point that only 1% of reported bugs are caused by hardware errors. He argued that the data was measured in 70's and that the number is most certainly lower "today" (the book was originally published in the '90s). But how much lower?

While I don't have any numbers on that, as of today, I can at least provide some anecdotal evidence. What good is anecdotal evidence? Well, it certainly doesn't give accurate statistics, but at least it can give an indication of what kinds of weird things are possible. Sometimes.

This story starts with something as ordinary as a build bug. Over here at Croteam, we run about a dozen builder machines doing CI builds all the time for various versions of several games on multiple platforms. Besides the expected, developer-caused build bugs, we've come to consider transient "glitches" as "normal". With several hundred builds (or at least build attempts) per day, one can see all the fun ways things can go wrong without anyone submitting buggy code or data: remote signing servers being off-line, Windows updates rebooting the machine in the middle of a build, weird OS bugs randomly reporting "file not found" when the file is obviously there, compiler internal errors appearing at random for recompiles of the same file that was compiled correctly before...

Those are mostly easily explainable: Server was down - check. Logs show machine was rebooted by Windows update - check. OS can leak stuff sometimes - we are not going to sweat over that. Compilers are complex beasts, especially when using batch compilation as they then push multiple source files without cleaning up between them completely.1 In all those (and many other cases), we've already learned to recognize the pattern and mostly just click "retry" on such a build failure. And it almost always passes on the second try.

But this one was different. A build machine that was happily churning out builds on an otherwise idle branch, did a trouble-free clean nightly build on Saturday (just as on several days before), but then it failed on Sunday. The error was our own code's error report that a data file is wrong version. "Saved with a newer version of the application", it said. Huh. But no one has changed, neither the application nor the file. Heck, no one's been working in that branch for weeks!2

My first reaction was: "Oh, the OS is having a tantrum again. Let's just re-run it." But it came back with the exact same error, on same file. I became suspicious and compared the file with the source control. What do you know... the file was different. One bit of the file was changed. But how can that be? The date of the file showed that it was synced months ago.

That was the time for a brainstorm. We went through all possible scenarios how a file on disk that was not written to can suddenly change its contents. And none of them were applicable. Not without Windows reporting that there's a read error on disk. Which they didn't report. We checked the disk for S.M.A.R.T. errors (without results), just out of paranoia, even though it didn't make sense.

Then, a colleague had an idea: what if the file was not damaged on disk? What if the file was in Windows file cache, and that bit in RAM became damaged in the process. That made sense, since the machine doesn't have ECC so a RAM error would go unnoticed. The machine did have 8GB of RAM, but it sounded like a bit of a stretch to expect that the file would still be sticking in cache even after we ran an another build, since each build generates several GB of just data files, not to mention intermediates, etc.

Nevertheless, just to be on the safe side, he rebooted the machine. Lo and behold - the file was now magically correct!

This kind of errors are often attributed to cosmic rays , but in normal work I always considered that more of a joke, not a real possibility. Guess I'll have to reconsider that notion.

Lessons learned: Don't add too much RAM to build machines. Seems it can have detrimental effects on build stability sometimes. Also, hardware is always right - except when it isn't.


 

1A careful reader will notice that the last two might also be caused by "cosmic rays". But who knows. It is hard to diagnose something like that in a closed source executable, let alone an entire closed-source OS.
2Why the heck do we do nightly builds on branches where no one is working on? Precisely for situations like this. So that if anything breaks the machine, we know it ASAP. If (when!) someone suddenly needs to urgently ship a patch on that branch, we want to make sure the build can be done, not to have to fight exotic hardware problems that appeared months before - while no one was watching.