There really is such a thing as too much information. Say, for instance, you’re an astronomer scanning the cosmos for black holes, or a climate scientist modeling the next century of global temperature change. After just a few days recording observations or running simulations on the most sophisticated equipment, you might end up with millions of gigabytes of data. Some of it contains the stuff you’re interested in, but a whole lot of it doesn’t. It’s too much to analyze, too much even to store.
“We are drowning in data,” says Rafael Hiriart, a computer scientist at the National Radio Astronomy Observatory in New Mexico, soon to be the site of the next-generation Very Large Array radio telescope. (Its precursor, the first Very Large Array, is what Jodie Foster uses to listen for alien signals in Contact.*) When it goes online in a few years, the telescope’s antennae will collect 20 million gigabytes of night sky observations each month. Dealing with that much data will require a computer that can perform 100 quadrillion floating-point operations per second; only two supercomputers on Earth are that fast.
And it’s not just astronomers who are drowning. “I would argue just about any scientific field would be facing this,” says Bill Spotz, a program manager with the U.S. Department of Energy’s Advanced Scientific Computing Research program, which manages many of the country’s supercomputers, including Summit, the world’s second-speediest machine.
From climate modeling to genomics to nuclear physics, increasingly precise sensors and powerful computers deliver data to scientists at blistering velocities. In 2018, Summit performed the first ever exascale calculation on, of all things, a set of cottonwood tree genomes, computing in an hour what would take a regular laptop about 30 years to finish. (An exabyte is a billion gigabytes—enough to store a video call that lasts for more than 200,000 years. An exascale calculation involves a quintillion floating-point operations per second.) Supercomputers in the works, such as Frontier at Oak Ridge National Laboratory, will go even faster, and generate even more data.
These humongous data volumes and incredible speeds enable scientists to make progress on all sorts of problems, from designing more efficient engines to probing the link between cancer and genetics to investigating gravity at the center of the galaxy. But the sheer amount of data can also become unwieldy—Big Data that’s too big.
This is why in January, the Department of Energy convened a (virtual) meeting of hundreds of scientists and data experts to discuss what to do about all this data, and the even larger data deluge coming down the pipeline. The DOE has since put up $13.7 million for research on ways to get rid of some of that data without getting rid of the useful stuff. In September, it awarded funds to nine of these data reduction efforts, including research teams from several national laboratories and universities. “We’re trying to wrap our arms around exabytes of data,” says Spotz.
“It’s certainly something that we need,” says Jackie Chen, a mechanical engineer at Sandia National Laboratories who uses supercomputers to simulate turbulence-chemistry interactions within internal combustion engines to develop more efficient engines burning carbon-neutral fuels. “We have the opportunity to generate data that gives us unprecedented glimpses into complex processes, but what to do with all that data? And how do you extract meaningful scientific information from that data? And how do you reduce it to a form that somebody that’s actually designing practical devices like engines can use?”
Another field that stands to benefit from better data reduction is bioinformatics. Though it’s currently less data-intensive than climate science or particle physics, faster and cheaper DNA sequencing means the tide of biological data will keep rising, says Cenk Sahinalp, a computational biologist at the National Cancer Institute. “Cost of storage is becoming an issue, and cost of analysis is a big, big issue,” he says. Data reduction could help with data intensive -omics problems like these. For instance, data reduction could make it more feasible to sequence and analyze the genomes of thousands of individual tumor cells to target and destroy specific groups of cells.
But reducing data is especially challenging for scientific problems because it must be sensitive to the anomalies and outliers that so often are the source of insight. For instance, attempts to explain anomalous observations of a form of light emitted from hot, black objects ultimately led to quantum mechanics. Data reduction that lopped off unexpected or rare events and smoothed every curve would be unacceptable. “If you’re trying to answer a question that’s never been answered before, you may not know” which data will be useful, Spotz says. “You don’t want to throw away the interesting part.”
The researchers funded by DOE will work on several strategies to tackle the problem, including improving compression algorithms, enabling scientific teams to have more control over which quantities are lost to compression; minimizing the dimensions represented within a data set; building data reduction into instruments themselves; and developing better ways to trigger instruments to start recording data only when some phenomenon occurs—for instance, an astronomer searching for exoplanets might want a telescope to only record data when it senses the slight dimming that occurs when a planet passes across a star. All will. to some extent. involve machine learning.
Byung-Jun Yoon, an applied mathematician at Brookhaven National Labs, is leading one of the data reduction teams. Over a Zoom call fittingly plagued by bandwidth issues, he explained that scientists already often reduce data out of necessity, but that “it’s more a combination of art and science.” In other words, it’s imperfect and forces scientists to be less systematic than they might like. “And that doesn’t even consider the fact that many of the data that are being generated are just dumped because they cannot be stored,” he says.
Yoon’s approach is to develop ways to quantify the impact of a data reduction algorithm on signals in a data set precisely defined by scientists, e.g., a planet crossing a star, or a mutation in a particular gene. Quantifying that effect will enable Yoon to tinker with the algorithm to get it to preserve an acceptable resolution within those quantities of interest, while carving away as much of the irrelevant data as possible. “We want to be more confident about data reduction,” he says. “And that’s only possible when we can quantify its impact on things that we are really interested in.”
Yoon aims for his method to be applicable across scientific fields, but will start with data sets from cryo-electron microscopy, as well as particle accelerators and light sources, which are some of the biggest data producers in science, expected to soon regularly generate exabytes of data, which just as soon will need reducing. If we learn nothing else from our exabytes, at least we can be sure, less is more.
Correction, Oct. 8, 2021: This article originally misspelled Jodie Foster’s first name.