Summarized from its abstract, there are two aspects to the study which are particularly interesting: 1) No more than four data points are required to identify an individual from an overall sample space of 1.5 million people to an accuracy of 95% at the data resolution used, and 2) Reducing the data resolution has only a 1/10 adverse effect on these results. This has significant implications for efforts to limit both governmental and private efforts to track private citizens' movements.
No one denies the possible benefits accruing from effective and creative use of personal data. Setting aside for the moment independent reasons one might want disclosure of personal data to be limited, correlation of disparate data sets allows advertisers to do better business, doctors to make better diagnoses, GPS devices to give better directions through traffic. Apart from commercial interests, it also helps analysts and academics to more accurately describe and predict sociocultural phenomena - which can itself have an effect on policy and law.
However, it is just as self-evident that telling everyone everything all the time is more than most average people are willing to do. Whether for specific articulable reasons, or merely because of an inchoate creepiness factor, there is a general understanding that certain steps should be taken to protect individuals' privacy.
There are two stock methods purported to protect privacy without overly hobbling data analysis: anonymization and coarsening. We've already discussed at length the relative ineffectiveness of anonymization. This study not only proves it, it also lets us put a number to it (at least in the context of cell-phone tracking): The study tells us that at a sample rate of once per hour and a geographical specificity no greater than cell phone towers, only four samples are needed to identify an individual to 95% certainty. But it goes further, showing us how ineffective coarsening is as well: if only one in ten cell towers was used, or the sample rate was lowered to once every ten hours, only one more data point would be required to maintain the same 95% accuracy.
In Jones, the Court declined to adopt the D. C. Circuit's "mosaic" search theory, which argued that at a certain level, data collection could be a search for Fourth Amendment purposes even though no one of the data points would itself be a search. Justice Sotomayor, in her concurrence, suggested that this idea might have more validity than the Jones opinion suggested. But even if it is adopted at some point in the future, this study suggests that, with the powerful analytical tools widely available today, data samples which would suffice to identify individuals to a high degree of specificity would not trigger the theory's protections.
Forcing citizens to choose to be Luddites rather than subject themselves to effectively perpetual surveillance is a choice, but not one most of us would recommend. If neither anonymizing nor coarsening data collection can effectively protect us against being identified against our will, what tools are left to lawyers, judges, policy makers, and individual citizens, to protect our anonymity?
A very nice study that shows the challenge of trying to prevent identification of data. It seems that once you know two a person's top two locations (home, work) identification from public data is pretty straightforward. Because so much of the uniqueness is expressed in the most common and second most common point, the carry disproportionate information. In contrast the downtown Boston on a Friday night mentioned in the article is less specific.
ReplyDelete