On Tuesday morning, a large swath of the internet broke. The chat service Discord wasn’t working. If you went to Shopify, Dropbox, Nest, or any number of other websites, you most likely ran into “502 Gateway Error” between 9:45 a.m. and 10:15 a.m. If you tried to check if it was your own internet connection by using DownDetector, that wasn’t working either. Some speculated it was a foreign cyberattack. It wasn’t.
The culprit this time was Cloudflare—the same company connected to, but not responsible for, a similarly massive outage in late June that took down thousands of websites. Cloudflare is one of the little-known backbones of the internet, providing hosting and security infrastructure to some 16 million websites. It keeps sites secure and it speeds up loading times. But when it has an issue, so do all of its websites.
The same applies to Google Cloud, Microsoft Azure, and Amazon Web Services, which collectively control nearly 77 percent of the cloud computing market. Thousands of other websites, apps, and services rely on these three companies, plus Cloudflare, to stay up and running. Some even rely on more than one service. So whenever the big three cloud computing providers have an outage—so does a much wider and nearly unmeasurable swath of the internet. Therein lies the problem.
Cloudflare’s chief technology officer said the service’s outage was caused by a “bad software deploy that was rolled back.” He added that the company was “incredibly sorry that this incident occurred. Internal teams are meeting as I write performing a full post-mortem to understand how this occurred and how we prevent this from ever occurring again.” But problems connected to massive tech companies have been frequent in the past few months.
A massive Google outage on June 2 affected YouTube and G Suite apps like Google Docs and Google Calendar. The problem rippled through Google Cloud, taking down services like Shopify, Snapchat, and Discord (Discord apparently has bad luck). Apple cloud services like iCloud and iMessage were also affected by the four-hour outage. Just two weeks later, Google Calendar went down for hours. A routing issue similar to the one that affected Cloudflare, Verizon, and thousands of website last month also caused another massive Google Cloud outage in November 2018. Amazon has had its fair share of outages, including one in March 2017 that disrupted some services like Slack that rely on AWS’s services.
These frequent, widespread outages illustrate the reality that a big chunk of the internet relies on a centralized framework of services. When any one of those services goes out, it will inevitably cause problems across the internet. The problems can start small—like a routing issue at a tiny internet service provider in Pennsylvania that works its way up through Verizon and across the internet—or they can start from the top like they did Tuesday with Cloudflare installing bad software. In the March 2017 Amazon case, someone pressed the wrong key and knocked out a huge data center in Northern Virginia.
But the point is that the internet is increasingly centralized and easily broken—a fragile place where a little error at the bottom or the top can quickly send the internet into chaos because of sites’ reliance on three or four key backbone services.