Where Everybody Knows Your Name

What do AOL customers, Netflix subscribers, and abortion seekers in Oklahoma have in common? Hacking their identities is a cinch.


We're all part of a huge, ongoing statistics project. Mostly, we become a part of various data sets anonymously, without even knowing it—as sales figures for Guitar Hero, traffic patterns on I-95, or levels of cocaine in an urban sewer system.

But there's another kind of data that gets released into the wild with increasing frequency: researcher bait. Netflix made its user-generated rating database publicly available as part of a prize competition designed to improve the site's movie recommendations. Three years ago, America Online released several months of search query information, just as a nice gesture to researchers. In both cases, the names and other obvious identifying information were removed before the data was set free.

Last month, Oklahoma set out to contribute a new mass of data to the world. New reporting requirements on abortion would have dumped a massive amount of information into a public database, available on the state government's website. The new laws require doctors to collect and report information about every abortion in the state, including the mother's age, marital status, race, number of children, education level, the mother's relationship to the father, the reason for the abortion, the cost, and method of payment. The form contains 37 questions, most with several subsections. The names and addresses of the women would have been omitted, though her zip code was part of the information to be disclosed.

But as it turns out, taking your name off of something doesn't mean your fingerprints aren't all over it. Even when obvious identifying information is stripped from a large data set, personal identities can often be cracked by a geek with time on his hands.

Geeks like Arvind Narayanan and Vitaly Shmatikov, to be specific, who broke the anonymity of the Netflix set by comparing the dates of specific rankings with similar rankings on the popular Internet Movie Database, where users reveal personal information in public profiles. The vulnerability of the AOL database so horrified researchers that they have mostly left the set alone, tempting though that juicy data is. For a taste of the kind of revelations from that "anonymized" set, check out what this guy was up to:

  • 17556639 how to kill your wife
  • 17556639 how to kill your wife
  • 17556639 wife killer
  • 17556639 how to kill a wife
  • 17556639 poop
  • 17556639 dead people
  • 17556639 pictures of dead people
  • 17556639 killed people
  • 17556639 dead pictures
  • 17556639 dead pictures
  • 17556639 dead pictures
  • 17556639 murder photo
  • 17556639 steak and cheese
  • 17556639 photo of death
  • 17556639 photo of death
  • 17556639 death
  • 17556639 dead people photos
  • 17556639 photo of dead people
  • 17556639 www.murderdpeople.com
  • 17556639 decapatated photos
  • 17556639 decapatated photos
  • 17556639 car crashes3
  • 17556639 car crashes3
  • 17556639 car crash photo

Searches for just a couple of addresses or phone numbers along with that astonishingly evocative list of murder-related searches and user 17556639 is in the bag. In 2000 then-graduate student Latanya Sweeney sliced and diced U.S. Census data and found that 87 percent of the population can be identified using only their date of birth, zip code, and gender. 

This fall, Paul Ohm of the University of Colorado Law School published a study on the "surprising failure of anonymization." He writes that we have "labored beneath a fundamental misunderstanding, which has assured us much less privacy than we have assumed. This mistake pervades nearly every information privacy law, regulation, and debate, yet regulators and legal scholars have paid it scant attention."

As Ohm notes, while the tech community has become very aware of the privacy issues surrounding large data sets over the last several years—Google has fought off broad government subpoenas demanding search queries, even though the feds weren't asking for personal information about users—Oklahoma state legislators don't seem to have gotten the memo. And it's safe to assume that federal legislators will suffer from the same problem. For now, the Oklahoma rules are on hold while a court considers a challenge to the law. The hearing was postponed this week, after a second judge recused herself from the case. But this won't be the last time courts have to consider the viability of laws like Oklahoma's. And as the federal government gets more involved with health care, the feds will be looking for ways to get more bang for their regualtory buck. One of the likely results: More disclosure mandates, so that we can all be part of the great, ongoing statistics project whether we like it or not.

There's an old(ish) adage that the Internet treats censorship as a malfunction, and routes around it. There's a corollary for online data, voiced by Sweeney, now of Harvard's Center for Research on Computation and Society, who has said that "data tend to flow around and get linked to other data." Stripping out information about names and addresses isn't enough to keep data secure. Digital data sets don't stay isolated. And as Ohm notes, that's the problem: "Data can either be useful or perfectly anonymous but never both."

Katherine Mangu-Ward is a senior editor at Reason magazine.