The Volokh Conspiracy
Mostly law professors | Sometimes contrarian | Often libertarian | Always independent
Corpus Linguistics in the Supreme Court
I've long been interested in this subject, and was particularly pleased to have Justice Thomas Lee of the Utah Supreme Court and Stephen Mouritsen guest-blogging in 2017 about their groundbreaking work on the subject. The subject came up in yesterday's argument with regard to Prof. James Phillips' and Prof. Jesse Egbert's forthcoming article, A Corpus Linguistic Analysis of 'Foreign Tribunal', so I'm very glad to be able to pass along this item from Prof. Phillips:
Yesterday in oral argument in ZF Automotive US, Inc. v. Luxshare, LTD, the Supreme Court discussed a paper we recently wrote. In that paper we performed corpus linguistic analysis to see how the term "foreign tribunal" was used around the time it was inserted into the statutory provision at issue in the case. And a couple of the justices expressed uncertainty about relying on our findings.
Chief Justice Roberts conceded, "I don't quite know what to make of that. That's … something new. I mean, have we relied on that source before?" In response, counsel answered that the Court had engaged in this type of methodology in a case called Muscarello, where the majority opinion surveyed the use of the verb "carry" in New York Times articles. To which the majority the Chief Justice asked, "[H]ave I ever done that before?"
Counsel replied that the Chief Justice's opinion in AT&T likewise used this type of methodology. Justice Barrett then stated that "the Court has never used the Corpus Linguistics database before." She noted that two lower courts have—the Sixth Circuit and the Utah Supreme Court—but repeated that "this Court has not." And she described that what the Court did in Muscarello and the Chief Justice's opinion in AT&T were both "a more informal survey."
We have several responses to this colloquy (and we note petitioners' counsel did a good job describing and defending corpus linguistics).
First, while it's true that the Chief Justice has never relied on one of the specific "sources" we did, and that the Court has not used "the Corpus Linguistics database before" in a majority opinion, we think that is somewhat like worrying about relying on briefing based on cases found in LexisNexis because the Court has always done its research in Westlaw. The underlying data in these databases of texts, or corpora, comes from some of the very documents the Court has looked to in the past. For example, the Corpus of Historical American English (COHA), hosted by Brigham Young University, includes articles from the New York Times—the very articles the Court was content to rely on in its more "informal" corpus linguistic analysis in Muscarello. We think a New York Times article can be instructive in the search for meaning when one searches a corpus and finds that article just as much as when one finds that article on the New York Times' website.
Second, and relatedly, as the Court has been willing to conduct "a more informal survey," conducting a more rigorous one needn't be viewed with uncertainty. Otherwise, that is akin to saying one is fine with asking a handful of one's neighbors who they will vote for President and from that draw the inference of who will win the election, but one has serious doubts about a large, random, national sample of prospective voters.
The reality is, the Court has long sampled texts of ordinary and legal language use to try and get a sense of how a word or phrase was being used in a particular time period. From Justice Thomas looking at the Federalist Papers to Justice Ginsburg turning to poetry to Justice Kagan citing Dr. Seuss, justices have been willing to perform this type of methodology. We just think that the size and representativeness of the samples of texts looked to in the past make it difficult to confidently generalize to the group of people or type of language the Court is interested in, such as the language of ordinary people. Fortunately, corpus linguistics can provide us with the language samples (corpora) and empirical methods (corpus analysis) one needs to be confident about such generalizations.
What is more, we looked at five sources in our study. One was what Justice Barrett referred to as a "Corpus Linguistic database," COHA, but we got very little data from this corpus. While a second was a corpus—BYU Law School's Corpus of Supreme Court Opinions of the United States (COSCO-US)—it merely consists of Supreme Court opinions. The exact same search and the exact same analysis—where the search results were just read in context—could have been done in Westlaw. The other three databases were ones the Court has often searched in the past: Westlaw (for federal court opinions), HeinOnline's Core U.S. Journals (for law review articles), and HeinOnline's U.S. Code. So if the Court wants to ignore the small bit of our analysis from COHA because a majority opinion has never cited to COHA before, that's one thing. But we don't see why it would ignore the analysis from sources it regularly relies on. And our analysis from those other four sources familiar to the Court are sufficiently clear regarding the meaning and use of "foreign tribunal."
Further, while Chief Justice Roberts or a majority opinion has never cited to one of the corpora of language use or scholarship that uses them, other individual members of the Court have. Justice Thomas used BYU Law School's Corpus of Founding-Era American English (COFEA) in his dissent in Carpenter v. United States and cited corpus-based scholarship in another case. And Justice Alito has twice cited to scholarly articles that rely one of these corpora in separate opinions he's written.
Probably more than the specific source, there may be questions over the methodology. But as noted above, the Court has long been doing "informal" corpus linguistics without calling it that. And what we did could be replicated in chambers.
For instance, the Chief Justice could have one of his clerks look for the 100 uses of the term foreign tribunal by the Supreme Court in Westlaw just prior to the enactment of the statutory language in 1964. He could then have two clerks independently read each instance and determine whether the more narrow, government-authority sense or the broader, private/non-government authority sense was being used. He could then compare how often the clerks agreed and what the percentage of each sense was that they found.
It is no wonder, then, that corpus linguistics has been called "Westlaw on steroids." And the Chief Justice could repeat that same analysis in the U.S. Code, Supreme Court opinions, and law reviews. Additionally, the rest of our analysis can likewise be replicated from our instructions and appendices.
Additionally, some kind of corpus linguistic analysis is much more common in the lower courts than just the two courts named by Justice Barrett, though we recognize she was not trying to provide an exhaustive list. So far we count about three dozen opinions in 22 distinct lower courts, including six U.S. Courts of Appeal, six district courts, and four state supreme courts.
Finally, respondent's counsel attacks our study. He says it's self-published, but it will be published by the Virginia Law Review Online and this Court has before cited articles on SSRN before they are officially published by a journal. He claims it was full of gaps, but never describes what those are. He says "it's inconsistent whether there were two or three coders," but it's not clear that matters and it's actually clear that for each analysis two coders looked at the material (not always the same two coders).
And he claims that "all it ends up doing is establishing that the phrase didn't really have a meaning as of 1964. They only were able to come up with a couple of hundred usages ever." Both statements are false. As our study made clear, the phrase did have a dominant meaning as of 1964: the narrow, government-authority sense. And we only analyzed 259 uses because when we found hundreds in a specific corpus or database we sampled those uses closer in time to 1964—the year the term "foreign tribunal" was adopted in the statute in question. As we spelled out in the paper, we found thousands of uses of the term, but focused on the more chronologically relevant ones. We would note, 259 uses of "foreign tribunal" is exponentially more than respondents put forth (or petitioners) or than the Court traditionally relies on when it has performed more "informal survey[s]."
Corpus linguistics can do what dictionaries cannot—namely analyze words and phrases and show which meaning is probable in a given context. Just as the Court and the legal world moved on from paper copies of reporters to searching online databases of cases, we think the Court should supplement its dictionary use with corpus linguistic analysis, especially when presented by a party. The Court will then be able to have increased confidence that it's uncovering the ordinary (or legal) meaning of terms used in legal texts.
Editor's Note: We invite comments and request that they be civil and on-topic. We do not moderate or assume any responsibility for comments, which are owned by the readers who post them. Comments do not represent the views of Reason.com or Reason Foundation. We reserve the right to delete any comment for any reason at any time. Comments may only be edited within 5 minutes of posting. Report abuses.
Please
to post comments
I certainly see this sort of analysis having merit, though I would also prefer that the justices do the search themselves (or have clerks do so), rather than rely on academic papers saying "this is what we found".
Let's see what critiques you get if you publish your methodology not in a law review, but instead in a journal of academic history.
Here is a hint about historical method. It does not help much with problems of historical context if you dragnet historical sources for text samples, and then put them before present-minded analysts for contextual insight. Analysts untrained historically cannot discern any meanings except those supplied by modern context. Modern contexts are all present-minded analysts have in their heads.
Also, the context interpretation problem is two-fold, a fact of which a present-minded analyst will typically be unaware. There is a remembering part, and a forgetting part.
The first aspect of the problem, the remembering part, is somewhat addressed by the corpus linguistics method. It brings back to view instances of past usage which may have disappeared during the interval between then and now, and serves them up as challenging reminders. That is somewhat like remembering accurately—although you might get an argument on that from many good historians.
The forgetting part is less intuitive, notably more troublesome, and unaddressed by corpus linguistics. The forgetting part has to do with de-constructing present-mindedness, to clear the way for accurate insight into antique context taken as an alternative, instead of as an analogy, to the present.
The cognitive furniture of everyone is equipped with materials afforded by lived linguistic experience. Lived linguistic experience, in turn, is a mish-mash of meanings derived from history—with recent personally remembered history predominating, and less-recent history diminishing in influence, in uneven proportion to its antiquity.
Putting aside other criticisms, one thing you can say about that is that the uneven character of lived historical context varies from person to person, depending on age, and its unevenness gradually trends toward analytical ineffectuality as the length of the historical interval increases. Eventually, no one is old enough to remember first-hand a too-distant past, or even second-hand from cultural inference.
Thus, for some time fixed in an increasingly distant past, the interval between then and now must be presumed to contain a predominant sample of all the contextual insight available to a present-minded person. That means a point has been reached when the contextual insight of a present-minded person is furnished all-but-entirely by occurrences which were without influence on the context prevalent during the historical era in question—because during the historical era in question essentially all of those occurrences lay in the unknowable future. The founders knew essentially nothing about language context in 2022, just as we can imagine essentially nothing about language context in 2250. We can be certain that almost every occurrence with power to contribute to language context in 2250 is yet to come—locked away from our insight in the unknowable future.
Thus for any date in long-past time, the context people used then does not connect at all with a future context—our context—formed all-but-entirely by events they could not imagine. And yet it is the latter context, the present-minded one, the one nobody in the past knew anything about, which a typical corpus linguistics analyst will bring to the analytical task. Under that set of limitations, almost nothing is possible, except misinterpretation. If insight into past context is to become logically possible, something must be done to forget present context, as a quarantine, to keep it out of an analysis in which it can play no legitimate part.
That is a big problem. It is the principle problem which academic historians encounter, as they attempt the daunting task of constructing by inference a forgotten passage of history, using as tools whatever survivals they can collate from historical records.
The historians' method differs from the method used by the present-minded corpus linguistics analyst. Where the latter brings to bear his modern contextual awareness—an awareness, remember, which must be utterly alien to the historical figures in question—the historian works instead with the entirety of the surviving historical context—which is to say all the survivals—which the corpus linguistics method must necessarily bypass almost completely, lest the method fail to deliver the simplicity of analysis desired by its operators.
The key to the historians methods is this—instead of using his own critiques to ascribe context to historical survivals, the historian makes the historical survivals critique each other. By that method the historian creates the necessary quarantine to separate the present from the past. It excludes systematically every unknown occurrence which lay in that era's unknowable future. It works because if the historian chooses them with attention to their relevance, none of the survivals used will be from a time future to the era in question. The unknowable future is thus ruled out of consideration.
That really is the only reasonable way to solve the historical context problem. To use survivals from the relevant past in their entirety, and to bring them to bear on the problem of context all at once, so the critique of evidence is done by the evidence itself.
It is not easy to imagine how corpus linguistics can be adapted to that kind of work. But perhaps it can be. The first attempts at many now-successful technologies now look naive in retrospect.