POSTED ON FEBRUARY 23, 2021 TO DATA INFRASTRUCTURE
Mitigating the effects of silent data corruption at scale
By Harish Dattatraya Dixit
What the research is: 
Silent data corruption, or data errors that go undetected by the larger system, is a widespread problem for large-scale infrastructure systems. This type of corruption can propagate across the stack and manifest as application-level problems. It can also result in data loss and require months to debug and resolve. This work describes the best practices for detecting and remediating silent data corruptions on a scale of hundreds of thousands of machines. 
In our paper, we research common defect types observed in CPUs, using a real-world example of silent data corruption within a data center application, leading to missing rows in the database. We map out a high-level debug flow to determine root cause and triage the missing data.  
We determine that reducing silent data corruption requires not only hardware resiliency and production detection mechanisms, but also robust fault-tolerant software architectures.
How it works: 
Silent errors can happen during any set of functions within a data center CPU. We describe one example in detail to illustrate the debug methodology and our approach to tackling this in our large fleet. In a large-scale infrastructure, files are usually compressed when they are not being read, and decompressed when a request is made to read the file. Millions of these operations are performed every day. In this example, we mainly focus on the decompression aspect. 
Before decompression is performed, file size is checked to see if the file size > 0. A valid compressed file with contents would have a nonzero size. In our example, when the file size was being computed, a file with a nonzero file size was provided as input to the decompression algorithm. Interestingly, the computation returned a value of 0 for a nonzero file size. Since the result of the file size computation was returned as 0, the file was not written into the decompressed output database.
As a result, for some random scenarios, when the file size was non-zero, the decompression activity was skipped. As a result, the database that relied on the actual content of the file had missing files. These files with blank contents and/or incorrect size propagate to the application. An application that keeps a list of key value store mappings for compressed files immediately observes that some files that are compressed are no longer recoverable. This chain of dependencies causes the application to fail, and eventually the querying infrastructure reports data loss after decompression. The complexity is magnified as this happens occasionally when engineers schedule the same workload on a cluster of machines.
Detecting and reproducing this scenario in a large scale environment is very complex. In this case, the reproducer at a multi-machine querying infrastructure level was reduced to a single machine workload. From the single machine workload, we identified that the failures were truly sporadic in nature. The workload was identified to be multi-threaded, and upon single threading the workload, the failure was no longer sporadic but consistent for a certain subset of data values on one particular core of the machine. The sporadic nature associated with multi-threading was eliminated but the sporadic nature associated with the data values persisted. After a few iterations, it became obvious that the computation of
Int (1.153) = 0
as an input to the math.pow function in Scala will always produce a result of 0 on Core 59 of the CPU. However, if the computation is changed to
Int (1.152) = 142
the result is accurate.
The above diagram documents the root-cause flow. The corruption affects calculations that can be nonzero as well. For example, the following incorrect computations were performed on the machine that was identified as defective. We identified that computation affected positive and negative powers for specific data values and in some cases, the result was nonzero when it should be zero. Incorrect values were obtained with varying degrees of precision.
Example errors:
Int [(1.1)3] = 0 , expected = 1
Int [(1.1)107] = 32809 , expected = 26854
Int [(1.1)-3] = 1 , expected = 0
For an application, this results in decompressed files that are incorrect in size and incorrectly truncated without an end-of-file (EoF) terminator. This leads to dangling file nodes, missing data, and no traceability of corruption within an application. Intrinsic data dependency on the core, as well as the data inputs, makes these types of problems computationally hard to detect and determine the root cause without a targeted reproducer. This is especially challenging when there are hundreds of thousands of machines performing a few million computations every second. 
After integrating the reproducer script into our detection mechanisms, additional machines were flagged for failing the reproducer. Multiple software- and hardware-resilient mechanisms were integrated as a result of these investigations.
Why it matters: 
Silent data corruptions are becoming a more common phenomena in data centers than previously observed. In our paper, we present an example that illustrates one of the many scenarios that can be encountered in dealing with data-dependent, reclusive, and hard-to-debug errors. Multiple strategies of detection and mitigation bring additional complexity to large-scale infrastructure. A better understanding of these corruptions helps us increase the fault tolerance and resilience of our software architecture. Together, these strategies help us build the next generation of infrastructure computing to be more reliable.
Read the full paper:
Silent data corruptions at scale
TAGS:      RESEARCH IN BRIEF
Prev
FOQS: Scaling a distributed priority queue
Next
Boosting the performance of virtual machines with Jump-Start
Read More in Data InfrastructureView All
JUN 15, 2021
Network hose: Managing uncertain network demand with model simplicity
APR 16, 2021
DIT — enabling de-identified data collection on WhatsApp
DEC 30, 2020
2020 year in review: Connectivity innovations, faster apps, and progress toward net zero
OCT 21, 2020
How Facebook is bringing QUIC to billions
OCT 9, 2020
Nemo: Data discovery at Facebook
AUG 15, 2020
Making aerial fiber deployment faster and more efficient
Related Posts
Related Positions
Product Marketing Manager, Facebook Connectivity Infrastructure
TEL AVIV, ISRAEL
Manager, Infra Data Science
TEL AVIV, ISRAEL
Core Data Science, PhD Intern (Tel Aviv 2021)
TEL AVIV, ISRAEL
Core Data Science, PhD Intern (Tel Aviv 2021)
REMOTE, EMEA
Research Scientist - Statistical Learning and Experimentation
BELLEVUE, US
See All Jobs
Facebook © 2021
About​Careers​Privacy​Cookies​Terms​Help
To help personalize content, tailor and measure ads, and provide a safer experience, we use cookies. By clicking or navigating the site, you agree to allow our collection of information on and off Facebook through cookies. Learn more, including about available controls: Cookies Policy
Facebook EngineeringOpen SourceFacebook Open SourceAndroidiOSWebCore DataData InfrastructureDevInfraProduction EngineeringSecurityConnectivityData Center EngineeringNetworking & TrafficVideo EngineeringVirtual RealityResearch PublicationsML ApplicationsAI ResearchResearch PublicationsWatch Videos