Risk-driven backbone management during COVID-19 and beyond
By Chiun Lin Lim
What the research is: 
A first-of-its-kind study detailing our backbone management strategy to ensure high service performance throughout the COVID-19 pandemic. The pandemic moved most social interactions online and caused an unprecedented stress test on our global network infrastructure with tens of data center regions. At this scale, failures such as fiber cuts, router misconfigurations, and power outages are a frequent occurrence.
We ran a simulation system that identifies possible failures and quantifies their potential severity with a set of metrics that measure network risk. The risk metrics, in turn, guided operational decisions for capacity deployment. Coupled with traffic priority management and proactive capacity enhancement, our backbone resiliently withstood the COVID-19 stress test while achieving high service availability and low latency, and efficiently handled traffic surges.
How it works: 
To satisfy the network’s service-level objectives (SLO), we started by defining a set of risk metrics around demand loss, availability, and latency stretch. All these metrics are computed with respect to possible failure scenarios in the network, which can be enumerated by going through all the components making up the network. The goal of the failure modeling is to estimate the likelihood of a failure scenario as well as the duration of the failure event. Each component failure is characterized by its mean time between failures and mean time to repair. These statistics are estimated based on a combination of historical data and clustering followed by Bayesian regression modeling on common features such as vendor, ownership, and geographical region.
Our risk simulation system periodically computes the aforementioned risk metrics. It works by taking a fresh snapshot of the network topology and demand and the set of failure scenarios to consider together with their failure characteristics as its input. Due to the high number of failure scenarios, each is sharded onto a number of worker jobs that run the same code as our SD-WAN controller to compute the traffic engineering decision for the given failure scenario. The decisions are aggregated to derive the risk metrics and then logged for continuous monitoring.
During the onset of COVID-19, the risk metrics reported a significant increase in demand loss (which captures the highest traffic loss across all simulated failure scenarios), a decrease in availability and increase in latency for all quality of service (QoS) classes. The risk metrics guided us to the possible failure scenarios that, were they to occur, would degrade the network operating conditions for certain regions. Capacity was proactively deployed to mitigate these risks. Another helpful technique was looking at the traffic flows from the regions at risk, differentiating the traffic by criticality and then downgrading the QoS to a lower priority. The QoS classes are, in order of importance, infrastructure control (class 1), user traffic (class 2), internal applications (class 3), and bulk data transfer (class 4). We downgraded a lot of latency-insensitive traffic from class 3 to class 4. Less capacity is thus needed to guarantee the same level of SLO.
Why it matters: 
There is a long lead time of months to years for building up capacity for backbone networks. As such, network operators typically procure capacity based on estimated traffic growth. When COVID-19 hit, there was a significant unplanned ramp-up in traffic within a short period of time, stressing backbone infrastructure all across the world. 
Facebook was able to react swiftly thanks to its risk-driven backbone management strategy. Leveraging the risk metrics computed by our simulation systems, we quickly identified the operational pain points and prioritized capacity enhancements to bring the network back to normal. Our experience has shown that a metrics-centric approach to backbone management could adapt to rare adverse external shock. We hope our research can help operators looking to build a more resilient network. We would like to thank Ying Zhang, Guanqing Yan, Satyajeet Singh Ahuja, Alexander Nikolaidis, Soshant Bali, Bob Kamma and Gaya Nagarajan for their work on this project. 
To learn more, watch our presentation at NSDI 2021.
Read the full paper:
A social network under social distancing: Risk-driven network management during COVID-19 and beyond
How we built a general purpose key value store for Facebook with ZippyDB
Open-sourcing a more precise time appliance
Read More in ConnectivityView All
OCT 7, 2021
Inside the Lab: Expanding connectivity by sea, land, and air
SEP 28, 2021
2Africa Pearls subsea cable connects Africa, Europe, and Asia to bring affordable, high-speed internet to 3 billion people
AUG 15, 2021
Apricot subsea cable will boost internet capacity, speeds in the Asia-Pacific region
JUN 28, 2021
Driving towards an open internet ecosystem to help tackle the digital divide
MAR 28, 2021
Advancing connectivity between the Asia-Pacific region and North America
DEC 30, 2020
2020 year in review: Connectivity innovations, faster apps, and progress toward net zero
Related Posts
Related Positions
Data Center Facilities Engineering, Systems Engineer
Data Center Facilities Engineering, Systems Engineer
Project Manager, Global Data Center Connectivity
Network Planning Engineer
Network Planning Engineer
See All Jobs
Facebook © 2021
To help personalize content, tailor and measure ads, and provide a safer experience, we use cookies. By clicking or navigating the site, you agree to allow our collection of information on and off Facebook through cookies. Learn more, including about available controls: Cookies Policy
Facebook EngineeringOpen SourceFacebook Open SourceAndroidiOSWebCore DataData InfrastructureDevInfraProduction EngineeringSecurityConnectivityData Center EngineeringNetworking & TrafficVideo EngineeringVirtual RealityResearch PublicationsML ApplicationsAI ResearchResearch PublicationsWatch Videos