In June, the White House announced that the government’s security clearance program, including for individuals in civilian roles, would be consolidated under the Department of Defense.
This reorganization, largely motivated by an enormous backlog of clearance investigations, is aimed at streamlining the clearance process, and in particular the “reinvestigation” of individuals with clearances that require periodic review. At the core of these new efficiencies, the DOD claims, will be a “continuous evaluation” system that autonomously analyzes applicants’ behavior—using telemetry such as court records, purchase histories, and credit profiles—to proactively identify security risks. The rollout is already underway: The DOD had enrolled upward of 1.2 million people in continuous evaluation as of November. But the program is far from uncontroversial, raising credible privacy concerns and the hackles of advocacy groups including the Consumer Financial Protection Bureau. As the DOD takes over millions of new civilian clearances, these worries will find a broader audience.
And, thanks to machine learning, a type of algorithm that allows an A.I. to learn by example rather than being explicitly programmed, it seems that things may soon get a lot more complicated.
Defense One recently reported that the Defense Security Service—an agency within the DOD tasked with handling matters relating to personnel security—is piloting a new clearance evaluation system powered in part by machine learning. Rather than simply scanning for changes in financial or legal status that clearance holders haven’t reported, the pilot will mine a wide range of data sources to predictively flag potentially risky employees before they abuse their clearances. In addition to data collected during clearance applications, by existing continuous evaluation systems, and through broad digital activity monitoring, the pilot will aim to mine data acquired via relationships with private-sector companies and other government agencies.
The pilot’s success hinges on the premise that this massive trove of data will capture small but significant behavioral changes in its subjects—which can then be correlated with operational risk. Speaking to Defense One, Mark Nehmer, the DSS’s technical director for research and development and technology transfer, argued that the system could protect national security while helping troubled clearance holders get the help they need, “If we do our jobs right, we can help prevent suicides, data breaches, or things that people, when they’re under stress, things that they do.”
The attractiveness of an autonomous system capable of identifying security risks before they become security failures is obvious. But pinning individuals’ clearance statuses—upon which many rely for their livelihoods and to work effectively in service of national security—to automated inference-making raises a range of troubling questions. Jonathan Zittrain of the Berkman Klein Center (at which I work) is fond of dividing challenges in machine learning into two broad categories—those that arise when machine learning goes off the rails and those that arise when it works as intended. That’s a useful framing here, as the DSS pilot presents a fair few of each. If development goes awry, the system could easily embed bias and a resistance to oversight and review—all without human intention. Even if it doesn’t, its very existence might jeopardize civil liberties, modify subjects’ behavior in troubling ways, stifle institutional adaptation, and reshape the role of algorithms in the workplace far beyond the clearance process.
What if the System Goes Off the Rails?
A machine learning–enabled monitoring system of this scope and complexity will struggle to embed a vital pair of operational qualities: fairness and compatibility with human review. While existing systems face the same challenge, the specific role and context of the DSS pilot raise distinctive complications.
First, fairness. The DSS system’s focus on determining the riskiness of clearance holders will situate it within a particularly troubled subclass of predictive systems, risk assessment tools.
Risk assessment tools used in the criminal justice system, for example, have been shown (according to some definitions of fairness, at least) to be harsher toward certain demographic groups. And while there’s undeniably value in approaches to risk scoring that force us to clarify our preferences and decision criteria more so than we do under the status quo, machine learning systems might accomplish just the opposite.
That’s because machine learning systems work by “training” on large datasets of past examples, extracting powerful correlations between relevant attributes. These correlations can then be used to make predictions about examples that weren’t in the training set.
This data-centric approach means that discrimination can easily creep in to automated decision-making processes in unexpected ways—data is great at capturing existing patterns of structural bias, which then quietly embed themselves within systems during training. The DSS will need to ensure that its system—which, according to Nehmer, will be gobbling up just about any data that it can get—doesn’t fall into this trap. Given that the system will be making high-stakes decisions on the basis of such a diversity of sensitive personal information, it could exercise harmful bias along any number of lines. It might decide, hypothetically, that people of color are more likely to present security risks, that pregnant women shouldn’t be offered top secret clearance, or that individuals with brown hair demand closer scrutiny than those with blond hair. This presents an enormous data engineering problem, necessitating the implementation of checks and controls throughout the entire collection and analysis pipeline. It’s often hard to determine which data will end up giving rise to bias, and setting aside too much data could prevent the system from functioning as intended.
How will we know when the system messes up? That gets to the second challenge: enabling human review and oversight. Being flagged as a risk could be very costly, potentially implicating loss of livelihood, exclusion from projects or promotions, and the nuisances that come with an expanded review. As such, project overseers and, in some cases, subjects of evaluation themselves should have insight into the system’s mechanics. Ideally, subjects would also have the ability to challenge and correct inaccurate information used to classify them as risks before any action is taken on the basis of those classifications. Since the use of machine learning systems to analyze anything so messy as human behavior almost always implicates a substantial false positive rate, detecting and addressing system errors is an absolute necessity.
Part of the challenge of enabling human oversight and review will stem from the technical characteristics of the proposed system. Many of today’s highest-performing machine learning systems—particularly those reliant on so-called deep learning—offer little to no insight as to how their highly accurate predictions are actually made. The DSS system, which aims to identify patterns across a massive range of data streams useful in estimating risk, could likely benefit from the sorts of byzantine, unintuitive inferences that these “black box” models make so well. But its designers will have to navigate the balance between predictive power and explainability with care—if a preference for olives on pizza is, for some unknown reason, tightly correlated with poor judgment (but only in individuals who own Toyota Priuses), should the system make use of that information? And could it be part of the basis for denying or revoking a clearance?
To make matters even more difficult, the DSS will also have to balance these compelling transparency interests with considerations relating to intentional manipulation by bad actors seeking to cover their tracks. When subjects of evaluation understand the system that is being used to evaluate them, they’re much better equipped to willfully mislead it. Consider the case of machine learning–based spam filtration software: Spammers can often figure out which patterns of content and metadata will relegate their dispatches to the junk folder, then intentionally craft spam to avoid these patterns. The malicious clearance holders we would most like to identify—agents of foreign powers, for instance—may also be the most motivated and savvy, taking active measures to avoid being flagged as risky. As such, offering subjects of evaluation the opportunity to advocate for themselves within a convoluted—and potentially unfair—algorithmic process could mean jeopardizing security by compromising sources and methods. Given that the national security establishment has a tendency to strongly privilege the protection of sources and methods over transparency concerns, this will likely make reviewing and challenging the system nearly impossible.
What if the System Works?
Now to imagine a world in which all of the thorny problems outlined above have been robustly addressed. The system exhibits no biases along demographic lines and can clearly point to the subtle but predictively powerful patterns in subjects’ financial lives, communications, online presences, routines, and consumer behavior that enable it to forecast risk effectively (and with a manageable false positive rate). But in winning acceptance through compliance, the system might actually generate deeper problems still.
First, a system capable of ingesting and processing a huge proportion of the data an individual produces—and leveraging it to flag that individual as a potential risk to national security, potentially jeopardizing multiple aspects of her life—would have a very pronounced effect on the behavior of its subjects. It could establish and autonomously enforce norms and constraints on individual behavior that go far beyond the concrete rules to which clearance holders agree. Nehmer says of subject behavior that “if it’s staying within a normative range then as long as there are no business rules that are broken, likely we don’t have a problem. If their activity dramatically increases there is likely stress. But there are a lot of ways to measure activity. If it significantly decreases, it’s likely that there is some external controlling factor on that.” That is, the system would aim to capture not only misdoings or clear precursors thereto, but also any internal emotional and cognitive states that fall outside of business as usual.
In some sense, this subtle norm-setting could be very useful, proactively discouraging individuals from even thinking about leaving the straight and narrow. But given that the system will, by Nehmer’s account, dig deep into personal data streams including what individuals communicate and publish, a lot of innocuous or useful activity could also raise flags. A clearance holder subjected to a sufficiently pervasive monitoring system would be incentivized to carefully curate her internal and external life even in endeavors completely unrelated to work. Such a pervasive breakdown of the boundary between professional and private life—reminiscent of China’s “social credit scoring” system—would compromise the privacy and agency of the very people tasked with upholding national security.
The costs of this penetrating algorithmic surveillance wouldn’t end with civil liberties. It could also undermine the U.S. government’s capacity to improve and adapt. Whistleblowers, even those who use internal processes, could be flagged as potential risks. As could others who have qualms about their work but want to push for reform through appropriate channels. It is often discontented employees who help institutions develop responsibly, but in this instance that very sentiment might deprive them of the standing to do so. The act of advocating responsibly against a broken status quo could trip just the wrong system, resulting in an invasive review process with potentially serious ramifications.
Second, systems developed for extremely sensitive, high-stakes applications at the federal level—like the review of clearances granting information to the nation’s secrets—have a history of, as Nathan Wessler and Mailyn Fidler have put it, “trickling down” to other applications at lower levels of government and in private industry. One canonical example is the stingray, a device that simulates a cell tower to track the locations and identities of nearby phones. While they were originally used primarily in terrorism investigations, stingrays are now accessible to many state and local police forces, which use them much more expansively. And—as is the case with many risk assessment systems—the oversight and review of stingray use has been stymied by concerns that transparency would make it easier for criminals to evade detection.
If the DSS pilot appears to be working, it could be the first of many. Law enforcement agencies could roll out similar systems to catch corrupt cops, with private companies doing the same to identify embezzlers, flag those who shouldn’t be trusted with trade secrets, or designate seemingly uninvested employees as poor candidates for promotion. In each of these situations, the equities at play could be different than those we see in the context of security clearances. Even if the use of a powerful and pervasive risk assessment system can be justified in the context of protecting national security—which, as should be clear by now, is far from a foregone conclusion—it may offer similarly severe harms with far less significant benefits in the hands of a widget company.
The exact scope and specifications of the DSS pilot are still unclear. We’re missing lots of specifics as to what the system will look like, which data it will touch, and how it will integrate with existing review procedures. But one thing is certain—to roll out a prediction-based risk assessment tool of this kind would signal a dangerous new direction for personnel management at the federal level, one that could have expansive implications for civil liberties and national security far beyond its initial use case.