A routine maintenance error severs Facebook’s data centers from the Internet for over 6 hours
On October 4, Facebook users suffered a complete outage affecting all apps including WhatsApp, Instagram, and Messenger for over 6 hours. Nearly 2.9 billion users were not only inconvenienced, but many also lost a crucial means of communication in regions where WhatsApp is the primary method for text messaging and voice calls.
It was quickly discovered that the culprit was a faulty configuration change on Facebook’s backbone routers that manage traffic between their data centers. A simple misconfiguration was propagated across their entire network that affected not only their users, but also impacted their own tools and systems, hindering Facebook’s ability to diagnose and solve the problem.
A more detailed account of the problem was later published by Facebook explaining the causes and how a routine maintenance task resulted in a complete and total service blackout. An incorrect command was sent to check capacity that inadvertently disabled Facebook’s border gateway protocol (BGP) routers, effectively severing its datacenters from the internet. Adding to the problem was a bug in an audit tool that should have caught the mistake, but didn’t, allowing it to be deployed live across their entire environment.
With the BGP routers offline, Facebook wasn’t broadcasting the routes to DNS servers on their network. DNS servers are crucial internet components that act as phonebooks, taking a domain name like facebook.com and translating it into an IP address. Facebook’s network has its own DNS servers that maintain the IP addresses for all its domains that are shared globally across the Internet. When a user tried to access any of Facebook’s domains during the outage, they were met with errors that there was no destination address to direct them to.
Misconfigurations are a Top Reason for Outages
It’s long been known that human error is a top cause of network and service outages. Complex environments amplify the likelihood that mistakes will be made and their effects will be more far-reaching. Through the next several years, Gartner says that misconfiguration errors will be responsible for 99% of all firewall security breaches.
In a case like Facebook, the BGP routers were the critical point of failure to an otherwise healthy network. A small change to their configuration managed to sever their connections to the internet for hours.
Security Policy Misconfigurations Can Be Worse
Although disruptive, it’s unlikely that any security event will result from Facebook’s recent outage. However, that’s not always the case. Time and again small changes have led to unintended security vulnerabilities, exposing organizations to the threat of attacks should they be found and exploited.
Network security policy misconfigurations not only can lead to wide-scale outages, but also are one of the easiest ways to accidentally pave the way for devastating security breaches. The Capital One breach in 2019 was directly attributed to a misconfigured firewall that left one of their cloud servers vulnerable, allowing the hacker to access sensitive data for over 100 million customers.
Identify, Eliminate and Protect with FireMon
FireMon gives you visibility into high-risk policies and vulnerabilities lurking in your infrastructure and prevents you from creating new ones before policies are deployed. Real-time search, on-going security assessments, and automatic policy violation detection give you the tools you need to manage network security policies across your entire environment from on-premises data centers to the cloud.
Misconfigurations are inevitable, but FireMon minimizes the chance that they’ll compromise your network security.