Gmail went down on Tuesday afternoon for a little more than an hour and a half. Were you terribly put out? I was, at first. I check my e-mail about four times a minute, so any outage instills a sense of panic and isolation. I got over it, though. On Twitter, which was still working, I discovered that everyone else was having the same problem, so I decided to take a nap. When I woke up a half-hour or so later, Gmail was humming again. In a blog post, Ben Treynor, Google’s reliability czar, explained that the problem was caused by a routine server upgrade that somehow went awry. Treynor added that his team was working on ways to prevent such a failure in the future and considers it “a Big Deal.”
Really? It’s nice to know that I’m entrusting my e-mail to a company that has a position called “reliability czar,” but it’s hard to see how Gmail being out for a short while constitutes a big deal. The outage was shorter than the last major Gmail failure—February 2009, two and a half hours—and the one before that, about two hours in August 2008, both of which the world survived. Moreover, was there anyone, anywhere, who couldn’t immediately work around the failure using another e-mail service, Twitter, IM, Facebook, or the phone? With so many other ways to connect to people these days, downtime on any one communications system is rarely more than a minor annoyance. And it’s one we’ll see more often, too: In an age of constant connectedness, occasional disconnectedness is sure to become routine. The Internet is complicated, and things fail. Get used to it.
Engineers use a concept called “uptime” to measure a Web service’s reliability. Uptime is usually expressed as a percentage over time; Google’s office programs aim for an uptime of 99.9 percent per month, which means that they should be down for no more than 45 minutes during a 31-day month, or nearly nine hours a year. This week’s outage obviously blew that goal for September, but over longer periods, Google says Gmail achieves 99.9 percent uptime.
In a blog post last year, Matthew Glotzbach, a product manager in Google’s enterprise software division, reported that Gmail had been down, on average, about 10 to 15 minutes per month over the previous year. As he pointed out, that’s an extremely good record. All e-mail systems suffer both “planned” and “unplanned” outages. Gmail never has planned outages (engineers work on the system while it’s up), and its unplanned outage time is far shorter than that of in-house business e-mail, which typically sees about 40 to 90 minutes of downtime per month, according to one survey (subscription required) of corporate IT managers. In making a pitch for corporate customers, who pay Google for Gmail and other apps, Google thus argues that its Web-based programs are more reliable than local versions. (You probably can’t sue Google for the outage, but you may be entitled to some compensation if you’re a paying customer.)
Indeed, Gmail’s uptime is comparable to that of other systems whose reliability we take for granted. The power grid in the United States is online about 99.9 percent of the time—the average household will see fewer than eight to 10 hours of downtime per year. The landline telephone network operates with similar uptime; you’ve probably picked up the phone to find a dead line once or twice, but it’s very, very rare. And I can tell you from personal experience that Gmail is at least as reliable as my home TV and Internet connections, which go down about once or twice a month, usually for a few minutes or so.
So if Gmail is as good as the power grid, the phone network, and home broadband, why does its failure spark such surprise and outrage—and always make national headlines? Part of it has to do with the Web’s pervasiveness. Electricity and phone failures are localized; they go out for different people at different times, usually as a result of natural causes. They’re also easy to explain—it makes sense that the power might go out when there are branches flying around everywhere. An online service’s outage, though, is sudden, inexplicable, and communal. Gmail goes down for everyone at the same time, none of us knows why, and because we’re all online and gabbing, the news spreads fast. Many people also spend a lot more time on Gmail and other Web services than we do on the phone or watching TV; even if you don’t really have any pressing reason to be on e-mail or IM, the idea that someone who needs to talk to you is unable to get in touch can, in these always-on times, be cause for a major freak-out. From a technical standpoint, a Web site’s failure may be just a small glitch, but an outage on Gmail, Twitter, or Facebook often feels like a blackout—a major urban event that leaves us all a bit unmoored, flocking to other social networks for group therapy.
What’s more, many online companies—Google especially—like to hold themselves up as being nearly immune to failure. Your local electricity company rarely boasts about its engineering talent or its huge and multiply redundant data systems; we’re trained to expect an occasional power outage and be patient when it happens. But a lot of us simply expect more from Google. If Gmail were actually down nine hours a year, the tech blogosphere would call it a scandal.
That’s a mistake. Though it may sometimes suggest otherwise, Google isn’t perfect, and the occasional service hiccup isn’t such a big deal. After all, think of all the points of failure that stand between you and your e-mail: Your power could fail, your home broadband could go on the blink, your computer or router or iPhone could self-destruct—and then there are the dozens of routers along the way, and the thousands of servers and power systems and cables and all the other bleeping things at Google’s server farms. Given all this, it’s sometimes a wonder that anything works at all. Gmail was down for just an hour and a half? Somebody give them a medal.