Linguistic corpora and the evolving study of language and meaning

The Volokh Conspiracy

Mostly law professors | Sometimes contrarian | Often libertarian | Always independent

In the introductory chapter to their excellent corpus linguistics textbook, "The Routledge Handbook of Corpus Linguistics," Anne O'Keeffe and Michael McCarthy give a brief account of Cardinal Hugo of St. Caro, who is credited with creating the first concordance of the Vulgate Bible. (A concordance is an index of every occurrence of a word in a given text, often including the word's textual environment.) Recognizing the scope of his task and lacking any modern form of data processing (it was the year 1230), Cardinal Hugo assembled the next best thing - a team of 500 Dominican monks - to help him complete his concordance.

For the past several years, the two of us have taught a course on Law & Corpus Linguistics at the BYU Law School, together with the law school's dean, Gordon Smith. As a timed classroom exercise, we have each class member assemble a corpus of the Federalist Papers using Laurence Anthony's excellent (and free) AntConc corpus software. The entire class downloads the software and then downloads the texts and within minutes each student has concordancing capabilities that would make Cardinal Hugo's head spin. The whole thing tends to take less than 10 minutes (and requires zero monks).

It is no accident that this course was first offered at BYU Law. BYU is the home of a number of the premiere corpora of American English, including the Corpus of Contemporary American English (COCA), the Corpus of Historical American English (COHA), and the News on the Web Corpus (the NOW Corpus), as well as the architect of these corpora, linguist Mark Davies. These corpora are freely available to the public. BYU is also currently assembling a Corpus of Founding Era American English (COFEA).

The COCA and NOW Corpus are monitor corpora, a type of corpus that is continuously updated with new texts in order to reflect contemporary usage. The COHA, as its name implies, is a historical corpus that facilitates the study of language at a given point in history. All of the BYU Corpora are tagged corpora - they contain metadata from a grammatical "tagging" program that automatically marks each word with a part of speech. A tagged corpus allows a researcher to look for all different forms of a single word in a single search (e.g., a search for the verb "carry" would automatically include every verb inflection "carries," "carrying," and "carried") and to search for results related to a particular part of speech (e.g., a search for the verb harbor not the noun "harbor").

The ability to create targeted, bespoke corpora and the availability of a wide range of professionally designed corpora has changed the way that many linguists and lexicographers go about studying language. We argued in a post Tuesday that a complete theory of ordinary meaning requires us to take into account the comparative frequency of different senses of words, the (syntactic, semantic and pragmatic) context of an utterance, its historical usage and the speech community in which it was uttered. Corpus tools can be used to measure ordinary meaning as conceptualized here.

Linguistic corpora can perform a variety of tasks that cannot be performed by human linguistic intuition alone. In our article, we demonstrate how each of these tools can be brought to bear on difficult cases dealing with questions of ordinary meaning - cases such as Muscarello v. United States, in which the Supreme Court had to determine the ordinary meaning of the phrase "carries a firearm."

Corpora can be used to measure the statistical frequency of words and word senses in a given speech community or register and over a given time period. Whether we regard the ordinary meaning of a given word to be the possible, common, or the most common sense of that word in a given context, linguistic corpora can allow us to determine empirically where a contested sense of a term falls on that continuum.

Corpora can also show collocation, which is "a sequence of words or terms that co-occur more often than would be expected by chance." Words are often interpreted according to the semantic environment in which they are found. And a collocation program shows the possible range of linguistic contexts in which a word typically appears.

Corpora also have a concordance or key word in context ("KWIC") function, which allows their users to review a particular word or phrase in hundreds of contexts, all on the same page of running text.

Linguistic corpora can be built from the ground up to represent the language use of a wide variety of speech communities or registers. As Lawrence Solan has noted, the choice among speech communities is "made tacitly in legal analysis, but becomes overt when the analysis involves linguistic corpora because the software displays the issue on a screen in front of the researcher.

One possibility worth highlighting is that of a distinct legal corpus. Some of the language of the law, or course, is written in a distinct legal dialect. Where a given term is thought to be a legal term of art, a legal corpus could be built to analyze its meaning in the legal vernacular. And such a corpus could be employed to compare the ordinary sense of a given term and its legal term-of-art usage.

Finally, a linguistic corpus can be built from texts representing the language use from any period in history for which there are surviving texts. To the extent our understanding of ordinary meaning should be informed by the linguistic norms and conventions prevailing at the time that a given legal text was drafted, corpus linguistics can provide powerful evidence of historic language use.

To address the question of the meaning of "carries a firearm" in Muscarello, we looked at the statistical frequency of the competing senses of "carry" in a context similar to that of the statute. We used the COHA to search for usage samples from the decade in which the statute was enacted. We used the collocation function of the corpus to better understand the textual environment in which "carry" tends to occur. We looked for sentences with similar syntactic and semantic contexts to those of the statute in question - sentences in which the verb "carry" has a human agent performing the carrying and a weapon object ("firearm" or one of its synonyms) being carried. We look for such instances in what we have argued is the relevant speech community and register for the interpretation of generally applicable federal statutes. To the extent that we view the question of ordinary meaning as involving the statistical frequency of a word in a given context, the analysis above tells us that "carry on one's person" is overwhelmingly the most common use of "carry" in this context.

We do not mean to suggest that this sort of analysis is perfect (and in a later post we will address a number of potential shortcomings). We do mean to suggest that such an analysis is superior to - and much more transparent than - the current practice of relying on dictionaries and judicial intuition to resolve questions of ordinary meaning.

The Volokh Conspiracy

Linguistic corpora and the evolving study of language and meaning

Latest

The Best and Worst States To Be a Smoker

Brickbat: We Can Sort It Out

Trump Says He Believes in Giving People a 'Second Chance.' For 6,000 People, the Answer Was No.

Bureaucracy, Labor, & Tariffs: How Government Is Making BLTs More Expensive

California's War on Goats Could Worsen the Wildfire Crisis

Recommended

The Volokh Conspiracy

Latest

The Best and Worst States To Be a Smoker

Brickbat: We Can Sort It Out

Trump Says He Believes in Giving People a 'Second Chance.' For 6,000 People, the Answer Was No.

Bureaucracy, Labor, & Tariffs: How Government Is Making BLTs More Expensive

California's War on Goats Could Worsen the Wildfire Crisis

Recommended

For America's 250th, Get 2 Years of Reason for $17.76