Outages may be a fact of life on the internet, but the past few days have been especially rough for some of us. Last Tuesday, Google Calendar went down for hours. Later in the week, Netflix, Hulu, and Xbox Live had problems. (Hulu on the night of The Handmaid’s Tale, no less!) And on Monday morning, users in the Northeast United States were hit by a widespread web outage that affected Verizon users and thousands of websites serviced by Cloudflare, a little-known backbone of the internet that provides security and performance services to 16 million websites. Even Downdetecter, a website that tracks whether other websites are up and running, went down briefly because of the issue.
The outage was an abrupt reminder that the internet is a fragile place where one little error—in this case, by a small company in Pennsylvania—can cause swaths of the web to break with little warning. In Monday’s case, it was because the internet’s map got broken.
At about 7 a.m., the outage started affecting Verizon and began to then spread to parts of Amazon Web Services (another major piece of behind-the-scenes internet infrastructure), Reddit, podcast app Overcast, the popular chat service Discord, e-commerce provider Sonassi, live-streaming platform Twitch, and web-hosting provider WP Engine.
Many of the affected websites were serviced by Cloudflare, and so the company started getting some of the blame Monday morning. People weren’t sure if the Verizon outage and the Cloudflare outage were connected. While it’s true that about 10 percent of Cloudflare’s 16 million websites—a massive swath of the internet—was affected, there was little that Cloudflare could do about the problem, according to its chief technology officer. That’s because Cloudflare’s traffic was never getting to them.
The internet uses something called a BGP, or border gateway protocol, which is basically a routing map or, as some call it, the USPS of the web. It takes internet traffic and data and picks the most efficient route to get that traffic to somewhere else on the internet (like you). That works great most of the time, but something went wrong on Monday. That something was a mistaken signal sent out by DQE Communications, a small commercial internet service provider that services about 2,000 buildings in Pittsburgh, Pennsylvania, according to Cloudflare Chief Technology Officer John Graham-Cumming. “This little company said, ‘These 2,400 networks, including some bits of Cloudflare, some bits of Amazon, some bits of Google and Facebook, whole swathes of the internet,’ they said those networks are ours, you can send us their traffic,” Graham-Cumming said. DQE confirmed that the problem originated within its network, and that it worked quickly to solve the problem. “We immediately examined the issue and adjusted our routing policy,” the company’s spokesperson said in a statement.
That misconfiguration was probably the result of automatic route optimizing software and not someone intentionally screwing up the routes, according to Andree Toonk at BGPMon, a company that monitors network routes and security. But the effect was the same: Once that new route was announced and the company mistakenly said it could take on all of that traffic, it spread—through what is referred to as a “route leak”—all the way up to Verizon, which apparently accepted the faulty routes and then passed them on. As a result, a massive swath of the internet’s traffic—traffic to major destinations like Facebook and Cloudflare—went off a cliff to nowhere. It was basically the Waze of the internet telling thousands of drivers to “go straight” into a ravine.
The problem only lasted a few hours. By 10 a.m. most of the downed services were back online. But Graham-Cumming said this is evidence of a wider problem—major internet service providers not having the safeguards in place to block and filter incorrect routes from propagating across the internet. “It begins with a small company doing something erroneous,” Graham-Cumming said. “The really big problem is that Verizon, as a large company, could have actually said, this doesn’t look right, we won’t pass it on. But they didn’t. They let it go out into the wider world, which affected a large number of folks.”
It’s pretty much a miracle that this kind of thing doesn’t happen more often. Cloudflare’s CEO blasted Verizon on Twitter over the issue. Verizon has been pretty quiet about the specifics of what happened. The company told Slate in a statement, “There was an intermittent disruption in internet service for some customers earlier this morning. Our engineers resolved the issue by 9 am ET. We are currently investigating the issue.” Cloudflare’s CEO and CTO are also criticizing the company for not responding Monday morning when they reached out to inquire about the problem.
But the real problem might be the whole BGP internet routing system, which relies basically on an honor system. “There are ways that we are trying to get away from this very trust-based system, which is to use cryptography,” Graham-Cumming said. “That way you have to prove that you are the owner of a network.” Graham-Cumming said adopting similar technology—called RPKI—is the way to avoid similar hiccups in the future, hiccups in which a tiny company claims to own parts of Cloudflare, Amazon, and Facebook, and Verizon’s systems don’t question it all. RPKI allows networks to better filter faulty BGP routes, and it requires that routes only be issued by networks that have the capacity and the right to announce that route. As Cloudflare pointed out in a blog post this afternoon, if Verizon had used RPKI, it would have seen that the routes issued by DQE were not valid, and they would have bene dropped automatically. AT&T and several other providers have already enabled RPKI frameworks. The technology wouldn’t just stop mistakes like this one or faulty automatic software issuing mistaken routes. It would also prevent pranksters and those out to cause trouble on the internet from maliciously issuing erroneous routes.
Which would make all of us feel better. No one appreciates a Monday-morning Reddit interruption.