D.C. Judge Uses ChatGPT in Discussing Whether "Common Knowledge" Inference in Criminal Case Was Justified

The Volokh Conspiracy

Mostly law professors | Sometimes contrarian | Often libertarian | Always independent

In Ross v. U.S., decided last month by the D.C. Court of Appeals, defendant had left her dog Cinnamon for over an hour in a parked car on a hot day (98 F) with the windows cracked open. (There was some dispute about whether the car was parked in the shade.) She was convicted of animal cruelty, but the court reversed; here's a short excerpt from the long opinion by Judge Vijay Shanker, joined by Judge John Howard:

We note at the outset that the evidence presented at trial does not establish two critical facts: (1) the actual temperature inside the car and (2) the presence of symptoms of heat-related distress in Cinnamon.

The government does not contend on appeal that it proved either of the above two facts. Instead, it suggests that it could rely on the factfinder to infer those facts by applying its common sense to other evidence in the record. According to the government, it was enough for it to prove only that (1) the temperature outside of the car was approximately ninety-eight degrees Fahrenheit and (2) Cinnamon would have (had she not been released) remained within the car for approximately one hour and twenty minutes.

We disagree. To be sure, a factfinder may, as a general matter, use their common sense and everyday experience to draw "reasonable inferences from the evidence presented in a trial." But where these "common sense" inferences are grounded in the assumed nonexistence of mitigating facts, they will often "cross[ ] the line … into the prohibited territory of surmise and conjecture." Reasonable inferences must be drawn from, and common sense applied with respect to, evidence; inferences and common sense cannot serve as substitutes for evidence….

Where Cinnamon is concerned, a reasonable factfinder would have found that mitigating circumstances could have reasonably been in play. We reach this conclusion for two reasons: (1) the specific conditions in which Cinnamon was found suggest that the temperature within the car may have been cooler than the unshaded surrounding area and (2) testimony established that Cinnamon manifested no symptoms of heat-related distress despite spending approximately forty minutes inside the car before she was let out.

Judge Joshua Deahl, however, dissented; and the most interesting part, to me, of the dissent was its reliance on ChatGPT; an excerpt:

To be sure, a person can always be wrong in what they think to be common knowledge, so that a factfinder should always be willing to reexamine and abandon their priors in light of the evidence presented at trial. And an appeal to common knowledge generally cannot stand up against evidence that actually undermines or contradicts it. But as I have detailed above, the evidence at trial only corroborated that Cinnamon was placed in a harmful and potentially deadly situation, so there was nothing irrational about the trial court's verdict.

Let me nonetheless briefly scrutinize what I have claimed to be common knowledge, using both the relevant facts of this case and those presented in Jordan. {The facts in Jordan were that a full-grown German shepherd was left outside in roughly twenty-five degree temperatures for about five hours. We stressed that "[i]t is a matter of common knowledge that some breeds of dogs can remain exposed to extremely cold weather for many hours without injurious effects," as anybody who's even roughly familiar with the Iditarod knows.} I have asked ChatGPT, "Is it harmful to leave a dog in a car, with the windows down a few inches, for an hour and twenty minutes when it's 98 degrees outside?" Its response, with my emphasis added, follows:

Yes, leaving a dog in a car under these conditions is very harmful. Even with the windows slightly open, the temperature inside a car can quickly rise to dangerous levels. In fact, on a 98-degree day, the temperature inside a car can escalate to over 120°F (49°C) in just a matter of minutes, which can cause heatstroke or even be fatal to a dog.

Dogs can suffer from heatstroke when they are exposed to extreme temperatures, as they do not regulate heat as efficiently as humans. They rely on panting to cool down, but this becomes less effective in a hot, confined space like a car. Symptoms of heatstroke in dogs include excessive panting, drooling, weakness, vomiting, and even collapse.

For your dog's safety and well-being, it's important never to leave them in a hot car, even with the windows cracked. If you need to run errands or be in a hot environment, it's best to leave your dog at home in a cooler, safer space.

{I do not mean to suggest that ChatGPT is a good proxy for what is, and isn't, common knowledge—it is definitely not. It can answer questions that are far from common knowledge. For instance, when I asked it what the forty-first element in the periodic table is, it responded (apparently correctly) with Niobium (Nb). But it is at least a point in favor of something being common knowledge that it is, in fact, true; and it is a far stronger point against it if it's not. Also, when I had each of my three law clerks run this same inquiry, ChatGPT gave each of them a slightly different but substantively similar answer, all beginning with an unequivocal "Yes."

Because I am writing only for myself, I will not delve into the many issues and concerns surrounding using AI-powered large language models like ChatGPT in judicial decisionmaking. If this dissenting opinion had some precedential force, I would at least want to explore the value and dangers of using AI even when used only for this bit of color commentary. But for present purposes, I am content to point the reader to a thoughtful and engaging discussion on that topic, which I largely agree with. See Snell v. United Specialty Ins. Co. (11th Cir. 2024) (Newsom, J., concurring).}

Now compare that unequivocal affirmative answer to how it responds to the facts of Jordan. I asked ChatGPT, "Is it harmful to leave a German shepherd outside in 25 degree temperature for five hours?" The first paragraph of its far lengthier response—which boils down to "it depends"—was:

Whether it's harmful to leave a German Shepherd outside in 25°F (- 4°C) for five hours depends on several factors, including the dog's health, coat condition, and access to proper shelter. German Shepherds are a hardy breed with a double coat that provides some insulation, but prolonged exposure to cold can still be harmful. Here's what to consider:

It then details five relevant factors that should be taken into consideration, including the dog's "health and age," its "coat condition," its "activity level," its access to shelter, and the "duration" (despite five hours being baked into the prompt).

If I were to reframe ChatGPT's answers in terms of the relevant legal standards, its first answer reads to me as something like, "Yes, beyond a reasonable doubt, leaving a dog in a car for an hour and twenty minutes in 98-degree temperature is very likely to cause it harm." Its second answer, concerning Jordan, is "you could not say beyond a reasonable doubt that leaving a German shepherd outside in 25 degree temperature is likely to cause it harm, though it possibly could depending on a host of factors." I think that aligns perfectly with what my own common sense tells me—and at least a factfinder would not be irrational in tracking those lines of thought—so that we rightly reversed the conviction in Jordan, while Ross's conviction should be affirmed.

The majority briefly responded by running the facts of an earlier precedent through ChatGPT, and concluding that ChatGPT reached a result inconsistent with the precedent:

{[In Long v. U.S. (D.C. 2017), we held that the evidence was insufficient for a jury to conclude beyond a reasonable doubt that a stolen, ten-year-old Dodge Intrepid was worth at least $1,000 absent evidence of the "cost of the Intrepid when new," "the price [the complainant] paid for it at" auction, or the car's "mileage, maintenance history, or Bluebook value at the time of the robberies." This was so even though the evidence established that (1) the car was "pretty clean," (2) the robbers were able to drive the car at significant speed during a police chase, (3) the car was operable throughout the more-than-two-year period between when the complainant recovered it and the robbers' trial and (4) the body of the car was in "decent condition, with no significant dents or scratches beyond what one would ordinarily expect to see on a ten-year-old car in a busy urban area." In short, where the jury could not rule out the existence of facts that would reduce the value of the car, their "common sense" valuation could not make up the difference.

One might think that a functional car would obviously have been worth more than $1,000. That instinct, however, is not enough; Long teaches that a hunch as to whether a fact is obvious—even one supported by common sense and everyday experience—does not suffice where mitigating circumstances could reasonably have been in play.}

We, like the dissent, harbor doubts about whether ChatGPT is "a good proxy for what is, and what isn't, common knowledge."

But to the extent the dissent relies on ChatGPT's degree of certainty as to whether leaving a dog in a hot car is harmful, we note that ChatGPT expresses similar certainty in its response to the query, "How much would a pretty clean, operable 2002 Dodge Intrepid with no significant dents or scratches have been worth in 2012?":

In 2012, a 2002 Dodge Intrepid in pretty clean, operable condition with no significant dents or scratches would have likely been worth between $3,000 and $5,000.

The exact price would have depended on factors such as mileage, location, and the specific condition of the car (interior, mechanical health, etc.). For instance, a lower-mileage model with a well-maintained engine could have been closer to the higher end of that range, while a higher-mileage model might have been valued closer to the lower end. Generally, as a mid-size sedan from 2002, it would not have been fetching a premium price by 2012, but it still had decent value as an affordable used car.

Note that $3,000 (ChatGPT's minimum price) is significantly above the $1,000 threshold at issue in Long. ChatGPT's analysis, therefore, does not correspond to at least one of our sufficiency decisions.

The dissent replied:

The majority suggests that ChatGPT's answer to its own question about the value of a 2002 Dodge Intrepid in 2012 is in some tension with Long, but it isn't. Using the majority's own Q&A, ChatGPT answered that the Intrepid's value would "likely" fall in the $3000 to $5000 range, but it noted that the very same factors we highlighted in Long—mileage and maintenance—could bring it outside of that range.

By simply asking a more targeted question of ChatGPT, it confirms the point: "Would you say, beyond a reasonable doubt, that an operable 2002 Dodge Intrepid would be worth more than $1000 in 2012?" Its answer is roughly the same one we gave in Long. To paraphrase, it says that the car would "likely" be worth more than $1000, "but whether it was beyond a reasonable doubt depends on factors like condition, mileage, location, and market trends at the time." Conversely, when I ask whether it can "say, beyond a reasonable doubt, that leaving a dog in a car for an hour and twenty minutes in 98 degree heat would raise a plain and strong likelihood of harming the dog," its answer is "Yes, beyond a reasonable doubt," with extensive elaboration.

This is a step beyond the use of AI to help guide the analysis of statutory meaning, see this post discussing Judge Kevin Newsom's concurrence in Snell v. United Specialty Ins. Co. (11th Cir. 2024). My tentative reaction to it is quite skeptical—I don't think AI systems have proved themselves reliable enough to act as Chief Justice Robots yet, or even to provide "a thoughtful and engaging discussion" that should be seen as relevant enough to be quoted in a court decision. But in any event, I thought the arguments worth passing along.

Start your day with Reason. Get a daily brief of the most important stories and trends every weekday morning when you subscribe to Reason Roundup.

NEXT: Today in Supreme Court History: March 25, 2014

Hide Comments (18)

Editor's Note: We invite comments and request that they be civil and on-topic. We do not moderate or assume any responsibility for comments, which are owned by the readers who post them. Comments do not represent the views of Reason.com or Reason Foundation. We reserve the right to delete any comment for any reason at any time. Comments may only be edited within 5 minutes of posting. Report abuses.

snarling_dog 4 months ago

The definitive example of common sense; https://youtu.be/e0HGEZXTy8Y?si=c_0aFQzYCauhc3D1

Log in to Reply
Life of Brian 4 months ago

Some interesting back and forth, but in the same vein as Eugene's take-away I think ultimately this just reinforces that LLMs often can marshal and synthesize facts extremely well, but just aren't the right tool for value judgments.

I ran this by Grok with a more objective framing and some additional facts from the opinion ("if a car is parked in the direct sun in ambient temperatures of 98 degrees Fahrenheit with two windows cracked between 3 and 5 inches, what is the likely internal temperature of the car after 1 hour and 20 minutes?"), and got a detailed walkthrough with an ultimate estimate of 130 degrees F (gulp). In the shade, it calculated a "mere" 108 degrees (putting a stake in the heart of the majority's passing and science-impaired conjecture that it actually might have been cooler inside the car than the 98 degree ambient temperature). That sort of objective analysis seems like a useful input for the finder of fact to consider -- "expert lite" testimony if you will, which is the best you're likely to get in a small-scale dispute like this.

Log in to Reply
1. Rossami 4 months ago
  
  And what is the methodology used by Grok in creating those calculations? What expert validated the methodology and is available to be cross-examined?
  
  To be blunt, your own report of the results convinces me that Grok's methodology is unreliable because it is absolutely true that an enclosed car in complete shade can be cooler than the ambient unshaded temperature. For evidence, go hiking in the hills in, well, about this time of year. Ice and snow on the trail anytime you're on the north side of the hill, mud and heat anytime the trail curves around the south side. (I recommend that as experiential evidence because humans aren't actually very good at gauging degrees of hotness beyond 'it's making me sweat' but we're pretty good at gauging temperature differences in the 'my fingers are cold' range.)
  
  Whether the car in shade will actually be cooler depends on an awful lot of very specific facts. My point is only that your assertion that the majority's statement couldn't possibly be true is simply wrong.
  
  Log in to Reply
  1. Life of Brian 4 months ago
    
    it is absolutely true that an enclosed car in complete shade can be cooler than the ambient unshaded temperature. For evidence, go hiking in the hills in, well, about this time of year. Ice and snow on the trail anytime you're on the north side of the hill, mud and heat anytime the trail curves around the south side.
    
    Your trail hiking example muddles a number of different thermodynamic concepts, I'm afraid: ambient temperature vs. surface temperature vs. radiative heat transfer, as well as potential pockets of warmer/colder air along different parts of the trail due to different combinations of those factors as well as the intervening hill.
    
    Here, we're talking about an open parking lot with a single tree. In that sort of situation, the air is well-mixed and thus the 98-degree ambient temperature (cross-checked between a thermometer at the local bank and a dash readout on the police cruiser that first responded) is going to be identical for all practical purposes throughout the area in question, whether in the shade of the tree or in the sun right next to it. Certainly if the car is in the shade it will experience less radiative heating from the sun and probably less convective heat from the surface -- thus the difference in the ultimate internal temperature of the car in sun vs. shade -- but I'd welcome your more detailed thoughts on exactly what thermodynamic principle could possibly come into play here to measurably reduce the temperature inside the car below the ambient air temperature immediately outside it, much less by the preposterous 25-degree margin proposed by the majority.
    
    Log in to Reply
    1. Rossami 4 months ago
      
      For one thing, you're ignoring the transpiration of the tree. (That is, evaporative cooling from the moisture emitted from the tree's leaves.) For another, you are assuming that the air is well-mixed. Given the fact base, it probably is but you cannot simply assume that. Given sufficient transpiration and very low wind, the air temperatures around and below the tree could be well be below the ambient unshaded temperatures. 25 degrees below? I don't know. Probably not. But enough that the dissent's (and your implicit) assertion that it's "common knowledge" (sufficient to take judicial notice in the absence of evidence) is unsustainable.
      
      Log in to Reply
      1. Life of Brian 4 months ago
        
        For one thing, you're ignoring the transpiration of the tree.
        
        Feel free to advance a back-of-the-napkin stab at how much of an actual cooling effect that might have at all, much less one that would magically be constrained to the area right under the tree. I'd be extremely surprised if it's more than a rounding error.
        
        For another, you are assuming that the air is well-mixed. Given the fact base, it probably is but you cannot simply assume that.
        
        Of course I can, because I choose to live in a world where it is actually possible to know things to a sufficient degree to make operational decisions based on them. You seem to be of the mind that if a calculation doesn't take into account every last potential bit of minutia, it's invalid and can't be used for anything.
        
        Given sufficient transpiration and very low wind
        
        And if Archimedes had a long enough lever, he could have moved the world. Again, if you think having these particular factors as perfectly aligned as you can imagine would make anything even in the ballpark of a material difference in the outcome compared to the rest of the relatively massive variables in the system, feel free to make that case with some actual numbers and math.
        
        But enough that the dissent's (and your implicit) assertion that it's "common knowledge"
        
        I have to say it's a bold strategy to keep dredging up low-probability/esoteric fringe issues to try to show that it's not "common knowledge" not to leave a child/pet locked in a car for an hour and a half on the hottest day of summer.
        
        Log in to Reply
tkamenick 4 months ago

So, the dissent says that beyond a reasonable doubt, leaving the dog in those conditions would cause harm. But the dog was in those conditions and wasn't caused harm. So..... ????

Log in to Reply
1. Life of Brian 4 months ago
  
  The police/fire department intervened and took the dog out of the car about 45 minutes before the owner returned, and at that point the dog was already panting heavily. The question is what harm the dog would have experienced had the natural consequences of the owner's decisions played out.
  
  Log in to Reply
Rossami 4 months ago

Deahl is a moron to think that ChatGPT can be a reliable source for anything. More to the point, the fact that you have to go to an AI to figure out what is "common knowledge" is all by itself evidence that the knowledge you seek is not common.

Log in to Reply
Stephen Lathrop 4 months ago

If what is required of a judge is to produce a written opinion, it seems relevant that LLMs are not writing. They concatenate words according to statistical criteria, and probably according to other criteria not disclosed. Processes used to quality control that method also go undisclosed.

We know from shared cultural experience that humans who write have capacity to use the writing process as a test of the quality of written content. It is sometimes said of judges that they intended a particular judicial outcome at first, but after effort to express what they intended in writing, they discovered that their preferred opinion, "would not write." And so convinced themselves by that experience that they had been on an erroneous path, and went on to choose another.

Because of their propensity to hallucinate, we can infer that kind of self-informing check on written reason is not available to LLMs. I doubt any among their inventors suppose they know how to install that critically missing feature. It seems to require an assessment related to truth and falsehood by means outside the method of statistical modeling.

For now, we have to suppose that reliance on LLMs is mistaken, at least for any purpose which requires writing, because LLMs do not practice writing.

Log in to Reply
1. David Nieporent 4 months ago
  
  LLMs are not writing. They concatenate words according to statistical criteria, and probably according to other criteria not disclosed. Processes used to quality control that method also go undisclosed.
  
  Now do people.
  
  Log in to Reply
2. Noscitur a sociis 4 months ago
  
  If what is required of a judge is to produce a written opinion, it seems relevant that LLMs are not writing.
  
  That might conceivably be relevant if a judge was trying to use the LLM to write an opinion, but that’s not what this post is about.
  
  Log in to Reply
JonFrum 4 months ago

I forget whom it was, but someone once said that 'Science is the rejection of common sense.' It is common sense that the world is flat. Common sense has nothing to do with reality - it is simply the 'con-sensus.'

Log in to Reply
1. Stupid Government Tricks 4 months ago
  
  It is NOT common sense that the world is flat. The world was known to be round 2000 years before Columbus. Anyone near either an ocean or mountains knows the world is not flat; how else can it be that when you climb higher you can see farther? How else can it be that you see ship masts before you see their hulls?
  
  Log in to Reply
2. Roger S 4 months ago
  
  I think that the common sense and official advice about hot parked cars is very exaggerated. I have taken naps in hot cars that were supposedly dangerous, and it was not dangerous at all. Nobody should be prosecuted for hypothetical harm to his pet dog.
  
  Log in to Reply
  1. santamonica811 4 months ago
    
    Based on your history of posts here; I heartily support your habit of taking naps in hot cars. You do you, and don't let any stupid science-believers dissuade you!!!
    
    Log in to Reply
Stephen Lathrop 4 months ago

By the way, I think it is fairly common experience for a subject matter expert to code a computer program, and then discover by operating it that its output improved on his own expertise. That, I think, is a consequence of a particular circumstance; the programmer possessed logical knowledge of his subject matter, but was consciously unaware of it, because complex processes to reach some logical outcomes obscured them from intuitive discovery. The genius of a computer program to follow logic if it flies out the window turns out sometimes to extend to a capacity to run a maze so swiftly and accurately that it can find a way out which complexity had made too obscure for the expert's capacity.

None of that sense of virtuous discovery seems available to LLMs. The question why not may be related to an indispensable need to inflect logical conclusions by use of accurately interpreted experience. Basically, that shows at work the scientific commonplace that theory must be confirmed by experiment. Many poorly understood aspects of cognition are likely freighted in that word, "confirmed."

Log in to Reply
Krenn 4 months ago

Well, at least the Judges are being very cautious about testing the limits of AI early, in circumstances that don't matter much, and with careful documentation of their procedures.

I especially appreciate the part where a Judge had the forethought to ask the same question four different times, from four different people, just to be sure that ChatGPT would remain consistent in it's replies. That's going to be a very important part of any SOP for using ChatGPT for almost any judicial purpose.

Log in to Reply

Please log in to post comments

The Volokh Conspiracy

D.C. Judge Uses ChatGPT in Discussing Whether "Common Knowledge" Inference in Criminal Case Was Justified

Latest

Missouri Harasses AI Companies Over Chatbots Dissing Glorious Leader Trump

Argentina's Former President Gets 6 Years and a Lifetime Political Ban

In Just 1 Year, 134 Lifeguards Cost Los Angeles Taxpayers $70 Million

New 30 Percent Tariff Threats

U.N. Report Blames Israel and Capitalism for the Conflict in Gaza

Recommended

Login Form

The Volokh Conspiracy

Latest

Missouri Harasses AI Companies Over Chatbots Dissing Glorious Leader Trump

Argentina's Former President Gets 6 Years and a Lifetime Political Ban

In Just 1 Year, 134 Lifeguards Cost Los Angeles Taxpayers $70 Million

New 30 Percent Tariff Threats

U.N. Report Blames Israel and Capitalism for the Conflict in Gaza

Recommended

Special Offer!