The following article is adapted from Viktor Mayer-Schönberger and Kenneth Cukier’s Big Data: A Revolution That Will Transform How We Live, Work, and Think, out now from Houghton Mifflin Harcourt.
Mike Flowers was a lawyer in the Manhattan district attorney’s office in the early 2000s, prosecuting everything from homicides to Wall Street crimes, then made the shift to a plush corporate law firm. After a boring year behind a desk, he decided to leave that job too. Looking for something more meaningful, he thought of helping to rebuild Iraq. A friendly partner at the firm made a few calls to people in high places. The next thing Flowers knew, he was heading into the Green Zone, the secure area for American troops in the center of Baghdad, as part of the legal team for the trial of Saddam Hussein.
Most of his work turned out to be logistical, not legal. He needed to identify areas of suspected mass graves to know where to send investigators digging. He needed to ferry witnesses into the Green Zone without getting them blown up by the many IED (improvised explosive device) attacks that were a grim daily reality. He noticed that the military treated these tasks as information problems. And data came to the rescue. Intelligence analysts would combine field reports with details about the location, time, and casualties of past IED attacks to predict the safest route for that day.
On his return to New York City a few years later, Flowers realized that those methods marked a more powerful way to combat crime than he’d ever had at his disposal as a prosecutor. And he found a veritable soul mate in the city’s mayor, Michael Bloomberg, who had made his fortune in data by supplying financial information to banks. Flowers was named to a special task force assigned to crunch the numbers that might unmask the villains of the subprime mortgage scandal in 2009. The unit was so successful that a year later Mayor Bloomberg asked it to expand its scope. Flowers became the city’s first “director of analytics.” His mission: to build a team of the best data scientists he could find and harness the city’s untapped troves of information to reap efficiencies covering everything and anything.
Flowers cast his net wide to find the right people. “I had no interest in very experienced statisticians,” he says. “I was a little concerned that they would be reluctant to take this novel approach to problem solving.” Earlier, when he had interviewed traditional stats guys for the financial fraud project, they had tended to raise arcane concerns about mathematical methods. “I wasn’t even thinking about what model I was going to use. I wanted actionable insight, and that was all I cared about,” he says. In the end he picked a team of five people he calls “the kids.” All but one were economics majors just a year or two out of school and without much experience living in a big city, and they all had something a bit creative about them.
Among the first challenges the team tackled was “illegal conversions”—the practice of cutting up a dwelling into many smaller units so that it can house as many as 10 times the number of people it was designed for. They are major fire hazards, as well as cauldrons of crime, drugs, disease, and pest infestation. A tangle of extension cords may snake across the walls; hot plates sit perilously on top of bedspreads. People packed this tightly regularly die in blazes. In 2005 two firefighters died trying to rescue residents. New York City gets roughly 25,000 illegal-conversion complaints a year, but it has only 200 inspectors to handle them. There seemed to be no good way to distinguish cases that were simply nuisances from ones that were poised to burst into flames. To Flowers and his kids, though, this looked like a problem that could be solved with lots of data.
They started with a list of every property lot in the city—all 900,000 of them. Next they poured in datasets from 19 different agencies indicating, for example, if the building owner was delinquent in paying property taxes, if there had been foreclosure proceedings, and if anomalies in utilities usage or missed payments had led to any service cuts. They also fed in information about the type of building and when it was built, plus ambulance visits, crime rates, rodent complaints, and more. Then they compared all this information against five years of fire data ranked by severity and looked for correlations in order to generate a system that could predict which complaints should be investigated most urgently.
Initially, much of the data wasn’t in usable form. For instance, the city’s record keepers did not use a single, standard way to describe location; every agency and department seemed to have its own approach. The buildings department assigns every structure a unique building number. The housing preservation department has a different numbering system. The tax department gives each property an identifier based on borough, block, and lot. The police use Cartesian coordinates. The fire department relies on a system of proximity to “call boxes” related to the location of firehouses, even though call boxes are defunct. Flowers’ kids embraced this messiness by devising a system that identifies buildings by using a small area in the front of the property based on Cartesian coordinates and then draws in geo-loco data from the other agencies’ databases. Their method was inherently inexact, but the vast amount of data they were able to use more than compensated for the imperfections.
The team members weren’t content just to crunch numbers, though. They went into the field with inspectors to watch them work. They took copious notes and quizzed the pros on everything. When one grizzled chief grunted that the building they were about to examine wouldn’t be a problem, the geeks wanted to know why he felt so sure. He couldn’t quite say, but the kids gradually determined that his intuition was based on the new brickwork on the building’s exterior, which suggested to him that the owner cared about the place.
The kids went back to their cubicles and wondered how they could possibly feed “recent brickwork” into their model as a signal. After all, bricks aren’t datafied—yet. But sure enough, a city permit is required for doing any external brickwork. Adding the permit information improved their system’s predictive performance by indicating that some suspected properties were probably not major risks.
The analytics occasionally showed that some time-honored ways of doing things were not the best, just as the scouts in Moneyball had to accept the shortcomings of their intuition. For example, the number of calls to the city’s “311” complaint hotline was considered to indicate which buildings were most in need of attention. More calls equaled more serious problems. But this turned out to be a misleading measure. A rat spotted on the posh Upper East Side might generate 30 calls within an hour, but it might take a battalion of rodents before residents in the Bronx felt moved to dial 311. Likewise, the majority of complaints about an illegal conversion might be about noise, not about hazardous conditions.
In June 2011 Flowers and his kids flipped the switch on their system. Every complaint that fell into the category of an illegal conversion was processed on a weekly basis. They gathered the ones that ranked in the top 5 percent for fire risk and passed them on to the inspectors for immediate follow-up. When the results came back, everyone was stunned.
Prior to the big-data analysis, inspectors followed up the complaints they deemed most dire, but only in 13 percent of cases did they find conditions severe enough to warrant a vacate order. Now they were issuing vacate orders on more than 70 percent of the buildings they inspected. By indicating which buildings most needed their attention, big data improved their efficiency fivefold. And their work became more satisfying: They were concentrating on the biggest problems. The inspectors’ newfound effectiveness had spillover benefits, too. Fires in illegal conversions are 15 times more likely than other fires to result in injury or death for firefighters, so the fire department loved it. Flowers and his kids looked like wizards with a crystal ball that let them see into the future and predict which places were most risky. They took massive quantities of data that had been lying around for years, largely unused after it was collected, and harnessed it in a novel way to extract real value. Using a big corpus of information allowed them to spot connections that weren’t detectable in smaller amounts—the essence of big data.
The experience of New York City’s analytical alchemists highlights many of the themes of our book. They used a gargantuan quantity of data, not just some; their list of buildings in the city represented nothing less than N=all. The data was messy, such as location information or ambulance records, but that didn’t deter them. In fact, the benefits of using more data outweighed the drawbacks of less pristine information. They were able to achieve their accomplishments because so many features of the city had been datafied (however inconsistently), allowing them to process the information.
The inklings of experts had to take a backseat to the data-driven approach. At the same time, Flowers and his kids continually tested their system with veteran inspectors, drawing on their experience to make the system perform better. Yet the most important reason for the program’s success was that it dispensed with a reliance on causation in favor of correlation.
“I am not interested in causation except as it speaks to action,” explains Flowers. “Causation is for other people, and frankly it is very dicey when you start talking about causation. I don’t think there is any cause whatsoever between the day that someone files a foreclosure proceeding against a property and whether or not that place has a historic risk for a structural fire. I think it would be obtuse to think so. And nobody would actually come out and say that. They’d think, no, it’s the underlying factors. But I don’t want to even get into that. I need a specific data point that I have access to, and tell me its significance. If it’s significant, then we’ll act on it. If not, then we won’t. You know, we have real problems to solve. I can’t dick around, frankly, thinking about other things like causation right now.”
Excerpted from Big Data: A Revolution That Will Transform How We Live, Work, and Think by Viktor Mayer-Schönberger, Kenneth Cukier. Copyright © 2013 by Viktor Mayer-Schönberger and Kenneth Cukier. Reprinted by permission of Houghton Mifflin Harcourt Publishing Company. All rights reserved.