The Volokh Conspiracy
Mostly law professors | Sometimes contrarian | Often libertarian | Always independent
Revised Version of "Data Scanning and the Fourth Amendment"
Now up to date.
I have posted a revised version of my draft paper, Data Scanning and the Fourth Amendment. It adds a bunch of new cases, including the various opinions from the Fourth Circuit's en banc ruling in United States v. Chatrie. It also updates the tech section. Abstract below.
A crucial question of Fourth Amendment law has recently divided courts: When government agents conduct a digital scan through a massive database, how much of a "search" occurs? The issue pops up in contexts ranging from geofence warrants and reverse keyword searches to the installation of Internet pen registers. When a government agent runs a filter through a massive database, resulting in a list of hits, is the scale of the search determined by the size of the database, the filter setting, or the filter output? Fourth Amendment law is closely attuned to the scale of a search. No search means no Fourth Amendment oversight, small searches ordinarily require warrants, and limitless searches are categorically unconstitutional. But how broad is a data scan?
This essay argues that that Fourth Amendment implications of data scans should be measured primarily by filter settings. Whether a search occurs, and how far it extends, should be based on what information is exposed to human observation. This standard demands a contextual analysis of what the output reveals about the dataset based on the filter setting. Data that passes through a filter is searched or not searched depending on whether the filter is set to expose that specific information. The proper question is what information is expressly or implicitly exposed, not what raw data passes through the filter or the raw data output. The implications of this approach are then evaluated for a range of important applications, among them geofence warrants, reverse keyword searches, tower dumps, and Internet pen registers.
This is just a draft, and it won't be submitted to journals until August or so. As always, comments are very welcome.
Editor's Note: We invite comments and request that they be civil and on-topic. We do not moderate or assume any responsibility for comments, which are owned by the readers who post them. Comments do not represent the views of Reason.com or Reason Foundation. We reserve the right to delete any comment for any reason at any time. Comments may only be edited within 5 minutes of posting. Report abuses.
Please
to post comments
So, what if the police have an AI system intercept and transcribe into a database ALL telephone calls. Then the police search that database with very limited filters (only mentions the murder victim's name and some detail not publicly released). It seems to me there has got to be more than JUST the search filters that is at issue.
I think there are two issues: (1) when the information either (a) enters police control or (b) forces a third party to access the data, and (2) when police search that database. The searching of a massive database already in police control should be based off the search filters. For instance, if there is a massive database of all telephone calls with foreign agents collected under FISA, and you know there is some telephone calls with innocent Americans that are incidentally collected, then the extent of the search of that database should be based on the filter used (as it was clearly lawfully collected information specifically targeting people who don't have Fourth Amendment protection). But if the government requires google to search their database of all gmail accounts for the same kind of information, that is a much bigger deal than just the filter used. Even if the filter is reasonable, the ability of the government to seize ALL of the information, even if temporarily and not searched by the government, requires something more. If the government had a specific account, that limits the scale of data temporarily seized by the government and makes the search reasonable. Its possible a geofence warrant might be similar to a specific account identifier, as the database might be sorted by location, so Google doesn't need to search every record to find relevant records (especially if the time window was short). But a full text search of something like all gmail accounts would be unreasonable regardless of the narrowness of the filter.
Thanks, although I think that's a pretty different question than the one the article takes on. You're imagining a massive wiretapping violation that would clearly be government seizures of the conversations.
(https://www.yalelawjournal.org/pdf/853_76rix2f4.pdf)
This is neat! Full disclosure: I'm not a lawyer. I am an academic social scientist with a background in search technology.
I think you might want to grapple with modern document embedding approaches, or what lay people would call "A.I. powered searches". You may have noticed that more modern A.I.-powered stuff allows you to search in ways that conceptually resemble how humans store information because the search index is not actually indexing the text it's referencing in the kind of manner you discuss in this article. Rather, it has contextual knowledge about how a wide variety of terms relate to each other. So, for instance, if I search "kevin bacon flashdance" on Google, all the results I get are for Footloose, the film he actually starred in. This isn't because a series of pages explicitly say "You're thinking of Footloose, not Flashdance" or anyone programmed a rule to this effect; it's because the embedding space knows that both Footloose and Flashdance are similar along a dimension that embeds 1980s musical movies, and Kevin Bacon also shares a dimension with Footloose. This kind of technology also allows us to, for instance, pick up "death tax" when someone searches for "estate tax".
Typically the way embedding models work is, based on training of an enormous corpus, the model learns which words tend to go together, and in turn can change words, phrases, sentences, and whole documents into coordinates in an extremely high-dimensional map of space. Related ideas might be situated similarly in space, opposing ideas might be situated very far from each other in space. Then the actual entries in the database are placed in that map. Search queries are also transformed into this space, surfacing things that are "nearby".
Some things to look up if you're looking to learn -- word embedding, document embedding, RAG (retrieval-augmented generation), vector database, vector search. Large database companies have been buying up embedding providers, and every AI company advertises their ability to generate embeddings, because they are key to how generative AI / large language models work.
I doubt courts have really had to consider the ins and outs of this -- you're already responding to a system that's slow to catch up to the status quo of 10-15 years ago -- but I think at least in theory you can do so in this article. The conceptual spaces mapped by embedding models are analogous in some ways to physical space, so maybe there's an opportunity to apply your treatment of geofencing to this context? You're the legal scholar, I just thought I'd tee this up as a question.
Thanks for this! Off the top of my head, I would think the same analysis would apply—you'd just have to realize that the search for one thing is then going to search the database for something else, and grapple with that as the effective filter.
There are also interesting questions about probabilistic searches via AI, from what I understand, that I still need to grapple with. Anyway, thanks for the very interesting comment.
I'm not a lawyer, but I think this 4th Amendment approach is mistaken; a warrant for a search should be required only when the owner of the property being searched doesn't want, or may not want, the search to occur. If a third party has acquired a database and is willing to let the government search it, the breadth of the search should be irrelevant, and the targets of the search should have no 4th Amendment right to object to the search at all, since it is not their property being searched.
For comparison, suppose a crime has been committed in a neighborhood. The police are allowed to go house to house asking people who are willing to speak if they saw something; they don't need a warrant to do this, no matter how many people they wind up questioning. Should the police use this information to locate and arrest a suspect, he could not raise a defense that the police needed a warrant to question the neighbors.
In the cases, the company isn't acting on its own. Either the government is acting directly, or it has obtained a court order to require the company to act.
Let me see if we understand each other. Suppose the government goes to X and asks to search their entire database of tweets for evidence of terrorist activity. They do not present a warrant, just a request, making it clear that compliance is voluntary, with not even veiled threats of retaliation, but X agrees and makes the universe of tweets available to the government for search. As a result, the government finds and arrests a number of suspects. Do you contend that the 4th Amendment rights of those suspects were violated, despite the fact that it was not their property that was searched? If so, why? Would it make a difference in your view if the government presented the search criteria to X and X did the searches itself and passed the results to the government?
I don't understand the scenario, as tweets are public: Anyone can search the entire database of tweets by going over to Twitter.
If you want to replace it with something private, like email, then that would be a general warrant that the Fourth Amendment prohibits. The filter setting would be to collect information about terrorism from each account, and would therefore tell you information about what is in or not in each account. So yes, their rights were violated.
It's the government, the veiled threats of retaliation are baked in. Let's not be like the Court pretending that the average person feels perfectly free to walk away from a cop who is talking to them, so long as he doesn't utter the word "arrest".
Is the philosophical infraction based on human eyes, or on fishing expeditions?
Is the price of participating in modern society giving up panopticon scans of all your thousand databases your data passes through every day? I cannot believe that.
Hmm. Fishing expeditions.
Suppose you go fishing in the sea in your trawler. You trawl away in an area of sea of 100 square miles and you return to port with a hold full of fish.
Was the “place” where you searched for fish
(a) that 100 square miles of sea ?
(b) something to do with how big the holes in your net were ?
If the government has a warrant based on probable cause ordering certain information to be handed over to them, is it a "fishing expedition" for the possessor of the data to hand over that data—while not handing over anything else?
"and particularly describing the place to be searched, and the persons or things to be seized."
I suppose the filter settings are analogous to "the persons or things to be seized", but wouldn't that leave "the place to be searched" the entire database?
Otherwise you're turning it into, "and particularly describing the persons or things to be seized, and the persons or things to be seized."; The place, and the thing looked for, are different requirements of the 4th amendment, not the same requirement.
You have to put the whole thing into play, not just one part.
"But how broad is a data scan?"
Easy...the scan can only be as big as the database.
If you scan X, you're not going to get Bank of America account data.
Even the Internet is a finite database (ever expanding/changing but still finite).
Which leads to "limitless searches are categorically unconstitutional . . . . "
How can a search be limitless if every database if finite?
(Unless you mean 'limitless' as in a time limit which then gets into the surveillance world which is similar but also different than searches.)
Automated data searches aren't like physical searches in this one respect: for a machine to search for a thing, it must necessarily search all the things. This is because a machine has no ability to "know" in advance whether any data should or should not be included in the search. Everything is symbolic, and all the symbols are undistinguishable in advance from the machine's point of view.
Therefore it seems to this layman that ruling out "universally" scoped data searches is a non sequitur. There IS no other kind of data search.
You are quite right to focus on what information is RETURNED as a result of the search, not the scale of the search. Specifically, what information might leak or be improperly revealed in the results. That is of special concern given that all searches are universal in scope.
So in addition to the search filters (search "queries" is the term we'd use), what needs to be authorized by the court is the specific extent of allowable returned results. For example, in a search of photographic data, the court might limit results to photos of person X and not persons Y.
You are quite right to focus on what information is RETURNED as a result of the search, not the scale of the search.
As Brett explains above, the warrant has to specify :
"the place to be searched" and
"the persons or things to be seized."
Prof Kerr's approach does fine with the latter, but skips right past the former.
The place to be searched is - ineluctably - that place or thing within the bounds of which the search is to be conducted. It is not what tool you use to conduct the search within that place or thing, nor is it what you are looking for, nor what you manage to find.
When you search with your hand within your washing machine for any possible stray socks that have got stuck to the sides, it is fairly unlikely that your searching instrument (your hand) will also find the sim card that you accidentally left in your shirt pocket. You're not using your hand to sweep closely and carefully for sim cards. Just for damp socks.
None of which affects the obvious point that the place that you are searching is the interior of your washing machine. That you find some socks does not mean that you were searching only that part of the interior of your washing machine in which the socks themselves were contained.
Given the nature of automated pattern matching, I suppose we could get some additional protections by having the courts specify TWO queries: first the query to select the universe of data that will be searched, and second the query to select the subset that will be returned for human reading.
And to be sufficient, these queries need to be stated in both positive and negative terms: you CAN include X but you CAN'T include Y. In software testing, we called the former the "happy path", and the latter the "exceptions". You can't have a sufficient test without both, like two sides of a coin.
So, to the professor's point, while it is helpful to talk about filters, I think the courts also need to understand the epistemological challenge here, that symbols are not like physical things and you can't just apply the rules of one to the other.
The NSA's domestic surveillance was explained to me thus: Government computers collect everything that flows through the Internet wherever the taps are located, terabytes upon terabytes. For each bit of data a little block of code decides whether to pass the data on or discard it. The output could conceivably be limited to legally interceptable data. In practice this won't be litigated due to the state secrets privilege or the ability to launder information of dubious provenance. If it were litigated, the argument in the paper says it is OK to collect basically everything as long as the little block of filter code works to a judge's satisfaction.
Now what if the filter is misused against others but not the litigant? Suppose this could be proved. I've read about CIA agents snooping on their exes, a sheriff tracking a judge's cell phone, and many police officers checking out hot women or hostile reporters. Does a pattern of misuse invalidate an otherwise legitimate use of a search tool?
I personally liked Judge Berner's approach in the Fourth Circuit geofencing case that Professor Kerr summarized as "No search occurs when the government gets geofencing information in anonymized form. So the initial stage of geofence warrants is not a search. However, when the anonymized data is later linked to a particular individual, a search occurs and a warrant is needed."
That seems right to me because there's all sorts of data sets that the government might reasonably want to analyze for reasons that are obviously not searches (e.g., how many people are buying eggs in Chicago lately?) but that feel very much like a search when the goal of the analysis is to find some specific individual(s) ("show me the list of the people who bought eggs at this particular Jewel-Osco on May 5 between 10 and 11 AM").
I think this approach is also roughly compatible with Professor Kerr's suggestion that filter settings are the key to this sort of analysis, with the overlay that the particular filter setting that matters is whether the filters are producing personally identifiable information as a result or not. So: 1) filters that aren't intended to identify any particular person aren't a search; 2) filters that identify a small number of particular people are a search that require a warrant, and 3) filters that by design identify a large number of people are impermissibly broad and not allowed under any circumstances.
I've never liked that approach, because it seems to me that a warrant should be required as soon as the entity that has whatever is to be searched ceases to be given a choice in the matter, whether on account of being subject to an explicit order, or just an implied threat of retaliation if permission isn't forthcoming.
Isn't the original purpose of a 'warrant' to demonstrate to the subject of the search that the agent of the government conducting it actually DOES have the authority to demand access? And without it the government agent had no more right to conduct a search than any random person?
I agree that if the government is going to compel whoever has the data to trawl through the data and/or turn it over, that should also require a warrant.
But there's lots of data sets available for free or for purchase, or by the government asking nicely (and I don't think that every government request comes with implied coercion). In my opinion, the government still shouldn't be able to use those data sets to identify individuals without a warrant. There's some interesting discussion of the topic in the context of the search for the Golden State Killer in this (student) article: https://lawcat.berkeley.edu/record/1136717?v=pdf
tl;dr: Third Party doctrine is dumb, and Carpenter probably didn't go far enough in limiting it.
Absolutely agreed about third party doctrine!
Wouldn't the most analogous situation be a canine search? The canine (machine) searches a location (datastore) based on instructions (training and commands) given by a human. The canine then alerts on specific areas (records) of the area (datastore), and those areas are further search by the human (records are returned to the human for further review).
It seems *at least* similar enough for existing bodies of case law and argument to be relevant to the discussion.
If one wants to make the VC a platform for airing one's individual grievances; would not the thrice-weekly Open Threads be a better place?