Blog

Story Behind Facebook Crash - October 2021


Simply appears to be a frivolous joke to us every one of us, we hear - "Facebook is down from web". On 4-5th October 2021 at 15:51 UTC, we opened an inside episode named "Facebook DNS query returning SERVFAIL" on the grounds that we were stressed that something was off-base with our DNS resolver 1.1.1.1. Yet, as we were going to post on our public status page we understood something different more genuine was going on. Online media immediately burst into flares, announcing what our designers quickly affirmed as well. Facebook and its subsidiary administrations WhatsApp and Instagram were, indeed, all down. Their DNS names quit settling, and their framework IPs were inaccessible. Maybe somebody had "pulled the links" from their server farms at the same time and separated them from the Web. This wasn't a DNS issue itself, however bombing DNS was the primary manifestation we'd seen of a bigger Facebook blackout.

Online media immediately burst into blazes, revealing what our designers quickly affirmed as well. Facebook and its partnered administrations WhatsApp and Instagram were, indeed, all down. Their DNS names quit settling, and their framework IPs were inaccessible. Maybe somebody had "pulled the links" from their server farms at the same time and separated them from the Web. This wasn't a DNS issue itself, however bombing DNS was the principal indication we had seen of a bigger Facebook blackout.

What did Facebook say about this issue?


Official Statement: "Our engineering teams have learned that configuration changes on the backbone routers that coordinate network traffic between our data centers caused issues that interrupted this communication. This disruption to network traffic had a cascading effect on the way our data centers communicate, bringing our services to a halt."

Facebook has now published a blog post giving some details of what happened internally. Externally, Cloudfare saw the BGP and DNS problems outlined in this post but the problem actually began with a configuration change that affected the entire internal backbone. That cascaded into Facebook and other properties disappearing and staff internal to Facebook having difficulty getting service going again. Facebook posted a further blog post with a lot more detail about what happened. You can read that post for the inside view and this post for the outside view.

BGP Issue


BGP conventions profits a component to trade steering data between Independent Frameworks Autonomous Systems (AS) on the Web. The large switches that make the Web work have colossal, continually refreshed arrangements of the potential courses that can be utilized to convey each organization parcel to their last objections. Without BGP, the Web switches wouldn't realize what to do, and the Web wouldn't work.

The individual networks each have an ASN: an Autonomous System Number. An Autonomous System (AS) is an individual network with a unified internal routing policy. An AS can begin prefixes (say that they control a gathering of IP addresses), just as travel prefixes (say they realize how to arrive at explicit gatherings of IP addresses). Cloudflare's ASN is AS13335. Each ASN needs to declare its prefix courses to the Web utilizing BGP; any other way, nobody will realize how to associate and where to discover us.

In the above diagram, 6 autonomous systems on the Internet and two possible routes that one packet can use to go from Start to End. AS1 → AS2 → AS3 being the fastest, and AS1 → AS6 → AS5 → AS4 → AS3 being the slowest, but that can be used if the first fails.

At 15:58 UTC, Cloudfare saw that Facebook had quit declaring the courses to their DNS prefixes. That implied that, in any event, Facebook's DNS servers were inaccessible. As a result of this current Cloudflare's 1.1.1.1 DNS resolver could presently don't react to inquiries requesting the IP address of facebook.com.

route-views>show ip bgp 185.89.218.0/23 % Network not in table route-views>show ip bgp 129.134.30.0/23 % Network not in table route-views> Then again, Facebook IP addresses remained directed yet weren't especially valuable since without DNS Facebook and related administrations were adequately inaccessible: route-views>show ip bgp 129.134.30.0 BGP routing table entry for 129.134.0.0/17, version 1025798334 Paths: (24 available, best #14, table default) Not advertised to any peer Refresh Epoch 2 3303 6453 32934 217.192.89.50 from 217.192.89.50 (138.187.128.158) Origin IGP, localpref 100, valid, external Community: 3303:1004 3303:1006 3303:3075 6453:3000 6453:3400 6453:3402 path 7FE1408ED9C8 RPKI State not found rx pathid: 0, tx pathid: 0 Refresh Epoch 1 route-views> A BGP UPDATE message notifies a router of any changes to a prefix advertisement or completely withdraws the prefix. When Cloudfare checked their time-series BGP database, they saw a significant increase in the amount of updates they got from Facebook. Normally, this chart is quite quiet: Facebook doesn't make many modifications to its network on a minute-by-minute basis. But at around 15:40 UTC Cloudfare saw a peak of routing changes from Facebook as shown below. That’s when the trouble commenced.
Cloudfare engineers started to worry that there was some fault within their systems and hence 1.1.1.1 was not able to resolve to facebook.com.

DNS Cascade


As a direct consequence of this, DNS resolvers all over the world stopped resolving their domain names. One of the systems check done by Couldfare engineers was as follows:

➜ ~ dig @1.1.1.1 facebook.com ;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 31322 ;facebook.com. IN A ➜ ~ dig @1.1.1.1 whatsapp.com ;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 31322 ;whatsapp.com. IN A ➜ ~ dig @8.8.8.8 facebook.com ;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 31322 ;facebook.com. IN A ➜ ~ dig @8.8.8.8 whatsapp.com ;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 31322 ;whatsapp.com. IN A This is due to the fact that DNS, like many other systems on the Internet, has its own routing algorithm. When someone puts https://facebook.com into their browser, the DNS resolver, which is in charge of converting domain names into real IP addresses to connect to, first checks its cache and utilises what it finds. If not, it attempts to obtain the response from the domain nameservers, which are generally hosted by the entity that owns the domain. If the nameservers are unavailable or fail to reply for any other reason, a SERVFAIL is returned, and the browser displays an error message to the user.

Impact on other services


People seek alternatives and want to learn more about or debate what is going on. When Facebook went down, we saw a spike in DNS requests to Twitter, Signal, and other messaging and social media services. The graph below depicts the progressive rise in interaction activity on several social media platforms:

Another side effect of this unavailability may be seen in our WARP traffic to and from Facebook's impacted ASN 32934. This graph depicts how traffic in each nation changed from 15:45 UTC to 16:45 UTC compared to three hours earlier. WARP communications to and from Facebook's network just vanished throughout the world.


Update from Facebook after reboot


At around 21:00 UTC Cloudfare saw renewed BGP activity from Facebook's network which peaked at 21:17 UTC.

PS: These incidents serve as a gentle reminder that the Internet is a very complex and interdependent system comprised of millions of interconnected systems and protocols. That trust, uniformity, and collaboration across entities is at the heart of making it function for almost five billion active users globally.

Source and References: Cloudfare Blog/Post, Facebook Blog