Preserving Today’s Internet

It’s not just about saving static webpages—it’s about documenting the experiences.

The web, with its ever-growing emphasis on personalization through algorithms, is generating a functionally infinite number of digital “objects.”

iStock/Thinkstock

You have a kid who goes on YouTube and looks at subway commutes, surfing his way through amateur footage of trains entering stations. Occasionally, he encounters an automatically generated recommendation that’s sketchy-looking. (You usually catch it before he can click.) His five- or 10-minute breakfast screen-time session may seem like an everyday event, but it’s also totally unique: a singular interaction of a particular human with a particular permutation of the YouTube algorithm. To Clifford Lynch, normal 2017 experiences like these are something special—something that we should think about saving.

In the online journal First Monday, Lynch, of the Coalition for Networked Information, has a fascinating speculative paper on what archives can do to document life in the age of algorithms. His call to action is a cousin to the arguments activists have been making that the algorithms that direct our lives should be transparent and subject to public scrutiny. But rather than focusing on accountability, Lynch wants records of what humans might have encountered on Facebook, Google, or Amazon during a typical interaction in 2017. He’s not just talking about the code that drives the algorithms—he means the experiences themselves, the choices offered to the kid watching train videos while having breakfast. “We have the choice of what we plan to bring forward” for future historians and curious citizens, he told me. “We can’t bring forward everything, but what we bring forward should have a sense of the present.” Without a record of algorithmic experiences, which now dominate most of our days, some of that “sense of the present” will be lost.

The challenges in documenting those experiences are myriad. First, there’s the question of feasibility. The web, with its ever-growing emphasis on personalization through algorithms, is generating a functionally infinite number of digital “objects,” presented in changing configurations. The limitations and challenges of recording those configurations become clear when you look at a project like the Internet Archive’s Wayback Machine, which takes pictures of web pages at particular points in time. When it comes to preserving the web, the Wayback Machine is probably the best thing we’ve got going for us. But in 2017, each person, bringing along his or her different browsing history, geographic location, and demographic profile, may see a distinct version of the New York Times’ homepage at 8 a.m. Eastern on Dec. 7, 2017. The ads and the featured stories will shift according to the reader’s browsing history and demographic profile.

“If you’d asked me five or seven years ago, ‘How can you preserve the digital New York Times?’ ” Lynch said, “I would have said ‘Somebody needs to get a copy of the database and keep it.’ Now, if you asked me that same question, I’d say, ‘That’s not enough, because the database doesn’t give you any sense of what viewers are actually seeing.’ ”

What needs to be documented, Lynch argues, is not only the objects an algorithmic system generates, but also the actions of the system itself, in conversation with particular users. The construction of emulators, which re-create older computer systems on new machines to help us see older digital files (and play old video games), can’t solve this problem, because the algorithmic systems we’re talking about are so huge, proprietary, and ever-changing. The goal of comprehensive preservation may be impossible: Not even the people who create the systems know everything there is to know about an algorithm, because machine learning dictates some of the system’s decisions. And even if were possible, the infrastructure needed to do it would be very expensive. Lynch points to the trouble the Library of Congress has had with the Twitter archive, the preservation of which was a hopeful story in digital archiving back in 2010. Because of legal issues and limitations in computing power, researchers still can’t access it.

What could be done, rather than striving to save every bit of the code, would be to produce records of the results of interactions between humans and algorithms. Lynch isn’t arguing that we should try to save a copy of every such interaction. As he rightly points out, many individual human experiences went undocumented and are now lost to history—most phone calls and most personal correspondence, for example. I think of the experiences people had in places like P.T. Barnum’s American Museum, which burned down in 1865, or the old Luna Park at Coney Island, which met the same fate in the mid-1940s. Those were multimedia, multisensory spaces, where people operated under a special set of social mores; no exhibit, virtual or real-world, can ever recapture the feeling of being there.

But a visitor might have written about a trip to Barnum’s museum in 1857. Because of the 19th-century penchant for deep description, we do have some (subjective, of course) textual records of those visits. But today, nobody is documenting our algorithmic experiences in detail. Think of a typical afternoon work break visit on Facebook. That quotidian block of 20 minutes stolen from employers is full of tiny little actions and reactions that we don’t record. Is anyone writing down lists of every choice they make during such a visit, and noting all of the ramifications of those choices—how quickly their friends commented, which ads they got served when? Is anyone recording all those little clicks and all of the algorithm’s personalized responses in a form that future historians will be able to mine when they want to understand Facebook in 2017?

The preservation of such algorithmic experiences may require the work of a hybrid new profession, Lynch said—something like an archivist-ethnographer-journalist. In his paper, he proposes the creation of a new discipline and profession: internet documentation. Internet documentarians would have to figure out how to record these experiences in a format that a future historian could use. They would face the ethical and conceptual dilemmas familiar to the anthropologist: How do you select, and then recruit, human “witnesses” who fairly represent the kinds of populations that use these algorithmic systems? Young people’s experiences should certainly be part of this record, but how do you procure consent for them to participate while underage? How do you make sure your witnesses don’t alter their behavior when they know they’re being recorded for posterity? (Lynch considers the idea of creating fake accounts—“sock puppets,” or “robotic witnesses”—who could be constructed in ways researchers think will elicit representative or interesting responses from the systems, but thinks the technical roadblocks to this approach may be too serious.)

“The archival world … has been largely in denial” about the challenges posed by algorithms, Lynch writes in his conclusion. “The existing models and conceptual frameworks of preserving some kind of ‘canonical’ digital artifacts are increasingly inapplicable in a world of pervasive, unique, personalized, non-repeatable performances.” There may be no perfect solution to this problem, as Lynch acknowledges. But let’s start talking about it.

This article is part of Future Tense, a collaboration among Arizona State University, New America, and Slate. Future Tense explores the ways emerging technologies affect society, policy, and culture. To read more, follow us on Twitter and sign up for our weekly newsletter.