The Volokh Conspiracy

Mostly law professors | Sometimes contrarian | Often libertarian | Always independent


Chief Justice Robots

What should it take for us to accept AIs as judges?


I have an unusually speculative article—more futurism than law as such—coming out in a few months in the Duke Law Journal, called Chief Justice Robots. I'd love to hear what people think. Here are the Introduction and the Conclusion; you can read the full article here:


How might artificial intelligence change judging? IBM's Watson can beat the top Jeopardy players in answering English-language factual questions. The Watson Debater project is aimed at creating a program that can construct short persuasive arguments. What would happen if an AI program could write legal briefs and judicial opinions?

To be sure, AI legal analysis is in its infancy; prognoses for it must be highly uncertain. Maybe there will never be an AI program that can write a persuasive legal argument of any complexity.

But it may still be interesting to conduct thought experiments, in the tradition of Alan Turing's famous speculation about artificial intelligence, about what might happen if such a program could be written. Say a program passes a Turing test, meaning that it can converse in a way indistinguishable from a human. Perhaps it can then converse—or even present an extended persuasive argument—in a way indistinguishable from the sort of human we call a "lawyer," and then perhaps in a way indistinguishable from a judge.

In this Article, I discuss in more detail such thought experiments and introduce four principles—perhaps obvious to many readers, but likely controversial to some—that should guide our thinking on this subject:

[1.] Evaluate the Result, Not the Process. When we're asking whether something is intelligent enough to do a certain task, the question shouldn't be whether we recognize its reasoning processes as intelligent in some inherent sense. Rather, it should be whether the outcome of those processes provides what we need.

If an entity performs medical diagnoses reliably enough, it's intelligent enough to be a good diagnostician, whether it is a human being or a computer. We might call it "intelligent," or we might not. But, one way or the other, we should use it. Likewise, if an entity writes judicial opinions well enough—more, shortly, on what "well" means here—it's intelligent enough to be a good AI judge. (Mere handing down of decisions, I expect, would not be enough. To be credible, AI judges, even more than other judges, would have to offer explanatory opinions and not just bottom-line results.)

This, of course, is reminiscent of the observation at the heart of the Turing Test: if a computer can reliably imitate the responses of a human—the quintessential thinking creature, in our experience—in a way that other humans cannot tell it apart from a human, the computer can reasonably be said to "think." Whatever goes on under the hood, thinking is as thinking does.

The same should be true for judging. If a system reliably yields opinions that we view as sound, we should accept it, without insisting on some predetermined structure for the process. Such a change would likely require changes to the federal and state constitutions. But, if I am right, and if the technology passes the tests I describe, then such changes could indeed be made.

[2.] Compare the Results to Results Reached by Humans. The way to practically evaluate results is the Modified John Henry Test, a competition in which a computer program is arrayed against, say, ten average performers in some field—medical diagnosis, translation, or what have you. All the performers would then be asked to execute, say, ten different tasks—for instance, the translation of ten different passages.

Sometimes this performance can be measured objectively. Often, it can't be, so we would need a panel of, say, ten human judges who are known to be expert in the subject—for example, experienced doctors or fluent speakers of the two languages involved in a translation. Those judges should evaluate everyone's performance without knowing which participant is a computer and which is human.

If the computer performs at least as well as the average performer, then the computer passes the Modified John Henry Test.[1] We can call it "intelligent" enough in its field. Or, more to the point, we can say that it is an adequate substitute for humans.[2]

I label the test the Modified John Henry Test because of what I call the Ordinary Schlub Criterion. As I noted above, a computer doesn't have to match the best of the best; it just has to match the performance of the average person whom we are considering replacing.

Self-driving cars, to offer an analogy, do not have to be perfect to be useful—they just have to match the quality of ordinary drivers, and we ordinary drivers don't set that high a bar. Likewise, translation software just has to match the quality of the typical translator who would be hired in its stead.[3] Indeed, over time we can expect self-driving cars and translation software to keep improving as the technology advances; the humans' average, on the other hand, is not likely to improve, or at least to improve as fast. But even without such constant improvement, once machine workers are as good as the average human workers, they will generally be good enough for the job.

Indeed, in the John Henry story, Henry's challenge was practically pointless, though emotionally fulfilling. Even if John Henry hadn't laid down his hammer and died at the end, he would have just shown that a team of John Henrys would beat a team of steam drills. But precisely because John Henry was so unusually mighty, the railroad couldn't hire a team of workers like him. The railroad only needed something that was faster than the average team—or, more precisely, more cost effective than the average team.[4] Likewise for other technologies: to be superior, they merely need to beat the human average.

Now, in some contexts, the ordinary schlub may be not so schlubby. If you work for a large company with billions at stake in some deal, you might hire first-rate translators—expensive, but you can afford them. Before you replace those translators with computer programs, you would want to make sure that the program beats the average translator of the class that you hire. Likewise, prospective AI Supreme Court Justices should be measured against the quality of the average candidates for the job—generally experienced, respected appellate judges—rather than against the quality of the average candidate for state trial court.

Nonetheless, the principle is the same: the program needs to be better than the average of the relevant pool. It doesn't need to be perfect, because the humans it would replace aren't perfect. And because such a program is also likely to be much cheaper, quicker, and less subject to certain forms of bias, it promises to make the legal system not only more efficient but also fairer and more accessible to poor and middle-class litigants.

[3.] Use Persuasion as the Criterion for Comparison—for AI Judges as Well as for AI Brief-Writers. Of course, if there is a competition, we need to establish the criteria on which the competitors will be measured. Would we look at which judges' decisions are most rational? Wisest? Most compassionate?

I want to suggest a simple but encompassing criterion, at least for AI judges' judgment about law and about the application of law to fact: persuasion. This criterion is particularly apt when evaluating AI brief-writer lawyers. After all, when we hire a lawyer to write a brief, we want the lawyer to persuade—reasonableness, perceived wisdom, and appeals to compassion are effective only insofar as they persuade. But persuasion is also an apt criterion, I will argue, for those lawyers whom we call judges. (The test for evaluation of facts, though, whether by AI judges, AI judicial staff attorneys, or AI jurors, would be different; I discuss that in Part IV.)

If we can create an AI brief-writer that can persuade, we can create an AI judge that can (1) construct persuasive arguments that support the various possible results in the case, and then (2) choose from all those arguments the one that is most persuasive, and thus the result that can be most persuasively supported. And if the Henry Test evaluator panelists are persuaded by the argument for that result, that means they have concluded the result is correct. This connection between AI brief-writing and AI judging is likely the most controversial claim in the paper.

[4.] Promote AIs from First-Draft-Writers to Decisionmakers. My argument starts with projects that are less controversial than AI judges. I begin by talking about what should be a broadly accepted and early form of AI automation of the legal process: the use of AI interpreters to translate for non-English-speaking witnesses and parties. I then turn to AI brief-writing lawyers—software that is much harder to create, of course, but one that should likewise be broadly accepted, if it works.

From there, I argue that AI judicial staff attorneys that draft proposed opinions for judges to review—as well as AI magistrate judges that write reports and recommendations rather than making final decisions—would be as legitimate and useful as other AI lawyers (again, assuming they work). I also discuss AIs that could help in judicial fact-finding, rather than just law application.

And these AI judicial staff attorneys and magistrates offer the foundation for the next step, which I call the AI Promotion: If we find that, for instance, AI staff attorneys consistently write draft opinions that persuade judges to adopt them, then it would make sense to let the AI make the decision itself—indeed, that can avoid some of the problems stemming from the human prejudices of human judges. I also discuss the possible AI prejudices of AI judges, and how they can be combatted.

Just as we may promote associates to partners, or some magistrate judges to district judges, when we conclude that their judgment is trustworthy enough, so we may promote AIs from assistants to decisionmakers. I also elaborate on the AI Promotion as to jurors, and finally move on to the title of this Article: AI judges as law developers.

Indeed, the heart of my assertion in this Article is this: the problem of creating an AI judge that we can use for legal decisions is not materially more complicated than the problem of creating an AI brief-writer that we can use to make legal arguments. The AI brief-writer may practically be extremely hard to create. But if it is created, there should be little conceptual reason to balk at applying the same technology to AI judges within the guidelines set forth below. Instead, our focus should be on practical concerns, especially about possible hacking of the AI judge programs, and possible exploitation of unexpected glitches in those programs; I discuss that in some detail in Part V.C.3.

This, of course, is likely to be a counterintuitive argument, so I try to take it in steps, starting with the least controversial uses of AI: courtroom interpreters (Part I), brief-writing lawyers (Part II), law clerks (Part III), and fact-finding assistants that advise judges on evaluating the facts, much as law clerks do as to the law (Part IV). Then I shift from assistants to actual AI judges (Part V), possible AI jurors (part VI), and finally AI judges that develop the law rather than just applying it (Part VII); that is where I argue that it makes sense to actually give AIs decision-making authority. It would be a startling step, but, again assuming that the technology is adequate—and that we can avoid an intolerable level of security vulnerabilities—a sound one….


A man calls up his friend the engineer and says, "I have a fantastic idea—an engine that runs on water!" The engineer says, "That would be nice, but how would you build it?" "You're the engineer," the man says, "I'm the idea man."

I realize I may be the joke's "idea man," assuming away the design—even the feasibility of the design—of the hypothetical AI judge. Perhaps, as I mentioned up front, such an AI judge is simply impossible.

Or maybe the technology that will make it possible will so transfigure society that it will make the AI judge unnecessary or irrelevant. If, for instance, the path to the AI judge will first take us to Skynet, I doubt that John Connor will have much time to discuss AI judges—or that Skynet will have much need for them. Or maybe the technical developments that would allow AI judges will produce such vast social changes that they are beyond the speculation horizon, so that it is fruitless to guess about how we will feel about AI judges in such a radically altered world. And in any event, the heroes of the AI judge story will be the programmers, not the theorists analyzing whether Chief Justice Robots would be a good idea. [Footnote: As Sibelius supposedly said, no one has ever built a statue honoring a critic.]

Still, I hope that I have offered a way of thinking about AI judges, if we do want to think about them. My main argument has been that

  • We should focus on the quality of the proposed AI judge's product, not on the process that yields that product.
  • The quality should largely be measured using the metric of persuasiveness.
  • The normative question whether we ought to use AI judges should be seen as turning chiefly on the empirical question whether they reliably produce opinions that persuade the representatives that we have selected to evaluate those opinions.

If one day the programmers are ready with the software, we should be ready with a conceptual framework for evaluating that software.

[1] This doesn't require a unanimous judgment on the part of the panel; depending on how cautious we want to be, we might be satisfied with a majority judgment, a supermajority judgment, or some other decision rule.

[2] In some contexts, of course, automation may be better even if it's not as effective—for instance, it may be cheaper and thus more cost-effective. But if it's cheaper and at least as effective, then it would be pretty clearly superior.

[3] Carl Sagan observed that no computer program "is adequate for psychiatric use today [in 1975], but the same can be remarked about some human psychotherapists." The question is never whether a proposed computer solution is imperfect; it's whether it's good enough compared to the alternative.


It didn't matter if he won, if he lived, or if he'd run.

They changed the way his job was done. Labor costs were high.

That new machine was cheap as hell and only John would work as well,

So they left him laying where he fell the day John Henry died.

Drive-By Truckers, The Day John Henry Died (2004).