The Volokh Conspiracy
Mostly law professors | Sometimes contrarian | Often libertarian | Always independent
Opening a File After A Hash Was Made and Matched to Known Image of Child Pornography is Not a "Search," Fifth Circuit Rules
An interesting case applying the private search reconstruction doctrine.
The Fifth Circuit has handed down a fascinating computer search case in United States v. Reddick. Here's the question: If a private company runs a hash of a file and compares the hash to those of known images of child pornography, and it finds a match to a known image and forwards on the file to the government, is it a "search" for the government to then open the file to confirm it is child pornography? Held, per Judge James Ho: No, it is not a search under the private search reconstruction doctrine.
First, some background. The private search reconstruction doctrine lets the government recreate a private search as long as it doesn't exceed the private search. The idea is that the private search already frustrated any reasonable expectation of privacy. Merely recreating what the private party did is within the private search and is not a new government search. But in the case of computers, that raises difficult issues: What is merely a recreation of a prior private search, and what exceeds the search?
In Reddick, the Fifth Circuit holds that actually opening a file that had matched to a known image of child pornography was not a search because "the government effectively learned nothing from [the agent's] viewing of the files that it had not already learned from the private search." Here's the analysis:
When Reddick uploaded files to SkyDrive, Microsoft's PhotoDNA program automatically reviewed the hash values of those files and compared them against an existing database of known child pornography hash values. In other words, his "package" (that is, his set of computer files) was inspected and deemed suspicious by a private actor. Accordingly, whatever expectation of privacy Reddick might have had in the hash values of his files was frustrated by Microsoft's private search.
When Detective Ilse first received Reddick's files, he already knew that their hash values matched the hash values of child pornography images known to NCMEC. As our court has previously noted, hash value comparison "allows law enforcement to identify child pornography with almost absolute certainty," since hash values are "specific to the makeup of a particular image's data." United States v. Larman, 547 F. App'x 475, 477 (5th Cir. 2013) (unpublished). See also United States v. Sosa-Pintor, 2018 WL 3409657, at *1 (5th Cir. July 11, 2018) (unpublished) (describing a file's hash value as its "unique digital fingerprint").
Accordingly, when Detective Ilse opened the files, there was no "significant expansion of the search that had been conducted previously by a private party" sufficient to constitute "a separate search." Walter v. United States, 447 U.S. 649, 657 (1980). His visual review of the suspect images—a step which merely dispelled any residual doubt about the contents of the files— was akin to the government agents' decision to conduct chemical tests on the white powder in Jacobsen. "A chemical test that merely discloses whether or not a particular substance is cocaine does not compromise any legitimate interest in privacy." 466 U.S. at 123. This principle readily applies here— opening the file merely confirmed that the flagged file was indeed child pornography, as suspected. As in Jacobsen, "the suspicious nature of the material made it virtually certain that the substance tested was in fact contraband." Id. at 125.
Significantly, there is no allegation that Detective Ilse conducted a search of any of Mr. Reddick's files other than those flagged as child pornography. Contrast a Tenth Circuit decision authored by then-Judge Gorsuch. See United States v. Ackerman, 831 F.3d 1292 (10th Cir. 2016). In Ackerman, an investigator conducted a search of an email and three attachments whose hash values did not correspond to known child pornography images. 831 F.3d at 1306. The Tenth Circuit reversed the district court's denial of a motion to suppress accordingly. Id. at 1309. Here, by contrast, Detective Ilse reviewed only those files whose hash values corresponded to the hash values of known child pornography images, as ascertained by the PhotoDNA program. So his review did not sweep in any "(presumptively) private correspondence that could have contained much besides potential contraband." Id. at 1307.
Interesting case.
It seems to me that there are two different questions potentially at work here. One question is whether opening a file after a private party has run a hash on the file exceeds the scope of the private party search for any kind of file. A second question is whether there are special rules for opening images of child pornography under the contraband search cases of Jacobsen and Illinois v. Caballes. On my initial read, I see Reddick as more about the second question than the first.
With that said, I have to think more about whether Reddick is a persuasive application of those cases. Here's why I'm not sure. The key to the contraband search cases of Jacobsen and Caballes is that the field testing and dog sniffing revealed nothing other than the presence or absence of contraband. The drug field testing in Jacobsen either returned positive or negative. The well-trained drug-sniffing dog in Caballes either alerted to the presence of drugs or didn't. It was a binary situation in which the only information learned was the presence or absence of contraband.
When a government agent opens a file, though, is more learned than whether the image is child pornography? I gather the opener of the file sees the full image, and then, after seeing the image, makes a judgement about whether the file is child pornography. The ultimate goal is to confirm that the image is child pornography. But more is learned than that; it's arguably less like using a drug-sniffing dog to alert for drugs than it is actually opening the trunk of the car and seeing the drugs. That latter act would be a search, even if the goal is just to confirm that a dog's alert for drugs was correct and to actually find the contraband.
I suppose this hinges on what the baseline knowledge should be for a opening a file. It's an interesting question. If it is known that a particular hash value corresponds with a particular known image, how do you model what is learned by opening a file that matched that hash? Do you say that the opener of the file already has the knowledge of what that particular image looks like, and that opening the file to see that it is that image really just confirms that it's a match and doesn't tell the agent anything else? Or do you model the agent's knowledge as just being that a file matched with some known image, and that opening the file thus gives the opener more information about what the file looks like? And in trying to answer that, do you consider just the individual opener's knowledge, or do you impose some sort of collective knowledge doctrine under which you consider the knowledge set of some broader group? I'm not sure.
It occurs to me that a related (but perhaps stronger) way for the court to have reached the same result would have been to rely on what some have called the single-purpose container doctrine. This doctrine goes back to a footnote in Arkansas v. Sanders, in which the Supreme Court stated that "some containers (for example a kit of burglar tools or a gun case), by their very nature, cannot support any reasonable expectation of privacy because their contents can be inferred from their outward appearance." In Robbins v. California, the Court explained that for this doctrine to apply, "a container must so clearly announce its contents, whether by its distinctive configuration, its transparency, or otherwise, that its contents are obvious to an observer."
It seems at least plausible that this could apply to opening a file with a known hash. If you know that a particular image has a particular hash, and you then have a file with that hash, then the information you have before you open the file "clearly announce[s] its contents . . . by its distinctive configuration" so that "its contents are obvious to an observer." The contents "can be inferred by [the file's] outward appearance," at least if you take "appearance" to include the hash value of the file. Notably, though, this approach would be broader than just child pornography. It would apply to opening any files with known hashes.
Finally, I gather that Reddick does not implicate the existing circuit split on how the private search reconstruction doctrine applies to computer searches. The existing split is on how to measure how much is "searched" when a private party accesses a computer: Does the private party access search the entire computer, or just the file, or the folder, or what was actually observed? In this case, however, there was apparently just one file at issue.
Anyway, it's a fascinating case. And it was a very well-written opinion from Judge Ho, I thought, at least after you ignore the extraneous citations to legal scholarship.
Editor's Note: We invite comments and request that they be civil and on-topic. We do not moderate or assume any responsibility for comments, which are owned by the readers who post them. Comments do not represent the views of Reason.com or Reason Foundation. We reserve the right to delete any comment for any reason at any time. Comments may only be edited within 5 minutes of posting. Report abuses.
Please
to post comments
If opening the file after verifying the hash is not a search as it is just confirming the suspicion, then the creation/comparison of the hash is the initial search. Are the "private entities" performing this search acting as agents of the government or are they truly performing this independently?
Was addressed in the opinion, and truly private searches (entirely independent of government) of the content of computer storage and communications have long been common in private business.
Takeaway: Don't use SkyDrive. What else are they rummaging around in your image data for?
This is my biggest concern. With the current dominance of a small group of companies over communication, the idea that the feds could pressure them into performing routine surveillance that the police cannot without a warrant isn't unfounded.
If it was a government-mandated search of the hash, then I think that the defendant would have a reasonable case.
1. It's not clear, to me at least, why opening a file sent by a third party would be a search.
2. Hashes aren't necessary unique. This might be a problem if there is no other reason to believe that the file is contraband.
Hashes are necessarily not unique.
Correct. Hashes are designed to map data of arbitrary size to a fixed size. It is not possible to map every possible value of greater size to the fixed size map without collisions. Secure hashes are designed to map in a way that makes it difficult to find a collision.
This is pedantic in the extreme. There are 2**240 atoms in the universe. There are 2**256 possible outcomes to a 256-bit hash.
"Very difficult" to find a collision is mathematical, in practice we believe that SHA256 collisions do not exist and cannot be computed even by a malicious adversary, let alone happen accidentally.
Nonzenze, just a question from someone without the mathematical resources to compute the answer. If you wanted to know how many pair-wise comparisons could be made from a given data set, what is the right method to do it? What do you suppose would be the number of photographic images online, and how would the possible comparisons derived from that number compare to your possible outcomes of a 256-bit hash?
For instance, assume 10 billion images online. Especially given frame by frame analysis of video, that number seems conservative, maybe extremely conservative. Is that pair-wise comparison number larger or smaller than the number of hash outcomes? How many images would have to be in the online data set before the pair-wise comparisons exceeded the possible hash outcomes?
Turns out I was mistaken on the size of the data set. Quaintly so, with regard to 10 billion images. It's many orders of magnitude greater. So that makes the question more pointed.
Let's reframe. Suppose you wanted to do a pair-wise comparison among internet files, to discover hash collisions, for every image stored on internet servers, including video frames as separate images. Isn't it self-evident that the computer power available to accomplish that would never be sufficient to keep up in real time even with ongoing daily increments, let alone cope with the entire internet archive?
So what are we to make of a story inviting an inference that some particular file was singled out as a result of a hash check, done at random? I suggest the only sensible thing you can suggest is that it wasn't random?that the file's owner was specifically targeted for a search before any hash evidence was available. Perhaps that is the part the legal analysis should focus on.
In saying that, I'm mindful that others are far better equipped with regard to the technical questions than I am. Please correct me if I need it.
Okay, so a moment's reflection shows the task is smaller, because you only need to compare all the stuff on the internet to known bad images, which are a tiny fraction of everything. Is there enough computer power to do that? Enough just to monitor the daily increment? Or should we still assume some kind of prior targeting would be the most likely explanation when a match is found?
It sounds like Microsoft uses PhotoDNA to scan all image files uploaded to SkyDrive (now called OneDrive), so I'm not sure you could call this 'targeted'.
Performing a single hash check is a pretty lightweight operation for most files. For small files, it's 10s to 100s of milliseconds. For very large files (gigabytes or more) it can take minutes.
If someone were to have access to the entire internet, then yes, they could perform this scan in a reasonable amount of time - give enough resources. Think Google taking every computer they have... and even then, it would take a few years. The US government could certainly do it (for those files they can access).
Scanning only new content is a little different. A service like OneDrive already has to process the file when it is uploaded, plus hashes are commonly used to detect errors in transfers anyway, so it is trivial for a scan of newly uploaded material to be performed.
This entire process is similar to looking for copyright infringing material - something that companies like YouTube do constantly. There may be some legal history there that could be applied here as well.
For a non-host to scan files is more difficult. NCMEC, for example, does not host files. So to scan anything they would need to first find and download the file - a step that requires major bandwidth, processor, and storage capacity. NCMEC does not have that capacity.
Compared to acquiring the files, performing the hash is trivial.
Just curious. When You Tube checks copyright infringement, do they do it on the basis of comparing image content? Or do they have some other way of finding a match, with metadata or something? I would be really impressed if they have the chops to compare recently submitted images?especially all the frames in a video?to some usefully complete database of copyrighted image material.
This is fingerprinting, an attempt to distill a unique value from the significant content rather than the representation.
Every time a webpage is opened, cryptographic operations far more expensive than a 256-bit hash are done (in fact, a dozen or so hashes are done, plus every packet is encrypted and signed and decrypted and verified).
The overhead of, e.g., scanning every file on Dropbox or OneDrive is trivial by comparison.
But, there is a major downside here -- a cryptographic hash is designed to change the ENTIRE OUTPUT if even a single bit of input changes. So H("Stephen Lathrop") and H("Stephan Lathrop") are entirely uncorrelated. A smart criminal can evade these sort of checks by modified a single insignificant bit.
Good questions!
First, please see the posts downthread. The actual match here was emphatically not a "hash" but rather a fingerprint. The difference is quite significant, and actually am quite upset at being bamboozled into applying the properties of the former rather than the latter.
[ And as for the properties of the fingerprint method, there is no detailed specification published, no vulnerability analysis or security proof offered -- flying in the face of all cryptographic practice. I would venture no claims on it whatsoever.
What follows then is purely about cryptographic hash functions, of which PhotoDNA is emphatically not. ]
If 10**B is the number of bad images and 10**N is the number of new images, then the odds are
10**( B+N-85 ). So for example, if there are a billion bad images and a trillion new images, that's
B=9, N=12, P = 10**(-64)
If we generate a trillion new images per second, then we expect to wait 10**56 years for a duplicate (the universe is 10**10 years old).
Not all hash functions are cryptographic. I'm not aware of anyone using cryptographic grade hash functions to do file check sums.
You're not aware that the most common source control software on the planet uses SHA-1 to identify records?
Not sure this was actually a 256 bit hash (doesn't seem to be stated in the case record, at least a search for "256" fails.)
Nevertheless, if it were a strict hash (not some sort of fuzzy match) then a byte-by-byte comparison of the file with a reference copy (easily indexed from the hash) would confirm or deny the match without any individual having to look at the actual image - but equally, making a trivial change (appending a single random byte to the file, or resaving with jpg compression) would cause the match to fail - which presumably the file uploader could have known.
Hash values aren't unique, but the probability of a collision depends on the number of bits, among other things. It could be significantly smaller, for example, than the probability that two people have the same DNA profile.
Well, it depends which hash algorithm. Anything from SHA2 and later is not only necessarily unique in the face of uncorrelated input, it's necessarily unique even in the presence of a malicious adversary.
MD5 OTOH, not so much.
What is the significance of "uncorrelated input?" Do you suppose correlated (or partially correlated) input could result from standardized methods for processing digital photos?
For instance, if I make a series of slight color corrections step-wise, all proceeding from the same initial image, won't that deliver a highly correlated set of slightly different images? From a hash standpoint, how large would my changes have to be before any two images became uncorrelated? Wouldn't that depend on the hash algorithm, and also on the meaning of "slight" in my description of the process?
Likewise, what happens if my edits are crops or slight rotations of the image in the frame? How about contrast adjustments which either spread adjacent color values farther apart, or merge adjacent colors into a single color, never to be distinguished again?
If any of that has significance, does that suggest an unknowable amount of uncertainty with regard to a theoretical treatment of the hash process? Could a hash algorithm applied to a step-wise process of the sort I describe end up distinguishing as different two images which most optical observers would pronounce the same? Would all the considerations relating to a step-wise series of digital still-photo edits apply alike to any series of images produced by videography?
That depends on the hash. For a good hash, even a 1 bit change results in a completely different hash value.
Note however, that PhotoDNA, which seems to be what was used, involves more than just hashing, but those differences will likely make it much more susceptible to collisions.
Stephen, I think you mis-parsed my statement.
I said, a cryptographic hash function is resistant to collision even in the face of a malicious adversary. That is, even given a file F and a hash H, it is infeasible for you to produce a second file F2 != F1 but H(F1) = H(F2) any faster than brute force.
It follows that it is likely resistant to collision even in the weaker case of correlated input due to similarities in processing.
Note also. With regard to the step-wise processes I described in my previous comment, it might be relevant to understand that the standard method for accomplishing them is not to produce a new image for each adjustment, but instead to preserve an initial image which remains unchanged, and store the adjustment instructions as text, while using those instructions to generate unique previews to facilitate judgment of the effects of each change.
But subsequently, a common practice is to choose one or more options from the previewed images, and generate from the unchanged original new images conforming to the original image as modified by the stored instructions. Those post-processed images then exist on a digital par with the original, and the code which created them is typically not stored with the image. The parent image can then be left behind, while the changed image goes abroad on the internet (or on to any other future) as a stand-alone. Thus, in the end, both related images exist together, but separately and independently.
> When a government agent opens a file, though, is more learned than whether the image is child pornography?
If the government knows the hash, if it's a secure hash, and if the government knows what the image is that matches that hash, then nothing is learned by opening the file. (Alternatively, all that's learned by opening the file is that hell hasn't frozen over.)
Although MD5 was designed as a cryptographic hash, it is no longer considered secure. It is now used as a checksum.
True, although it's not clear to me how the problems with MD5 could cause this particular use (if MD5 were actually used - it isn't) to become a search, as any collision between the user's file and a child porn image would not be accidental.
If it was MD5, an adversary with access to child porn could craft a document (e.g. a PDF) and massage it until the hash matched that of the child porn.
MD5 has long (2009) been broken. Relying on it to be collision resistant in 2018 is negligent at best.
"The private search reconstruction doctrine lets the government recreate a private search as long as it doesn't exceed the private search. The idea is that the private search already frustrated any reasonable expectation of privacy."
So the 4th amendment goes bye bye if the government just informally delegates the initial search?
This seems to me to be a paramount issue. Private individuals can turn up lots of information using means forbidden to the authorities. The question is whether they operate in a truly private capacity or as agents of the police.
I confess to being a complete newbie about both hashes and reasonable private-police searches. It doesn't help me that the story doesn't mention whether the police, being alerted, got a warrant for the "search." It seems not, as the argument is that there wasn't a "search" at all.
Depends what "informally delegate" means. If "informally delegate" is a type of delegation, then no. If "informally delegate" is a FUD term you made up to describe something other than delegation, then maybe.
"So the 4th amendment goes bye bye if the government just informally delegates the initial search?"
No. Private actors are treated as the government if they are acting on the government's behalf.
That's why the government is going to lie about it. Lies are a routine part of our legal system. See "parallel construction" if you doubt this.
It's more or less considered the same if you go into the police station to report a crime. Even if you gained the information or evidence in a manner that would be illegal as a search, the police can still use your information.
The biggest question is whether Microsoft is doing this search out of their own concern or whether this was a condition to not get broken up all those years ago.
Was it necessary to open and view the file to prosecute Reddick for child pornography, or was the hash sufficient? Profuse sweating, increased heart rate, respiration and pupillary contraction are not equivalent to a confession. When external sensors can detect at a distance, in a public area, brain activity consistent with child pornography imagery, is that sufficient to allow a warrentless seizure and search of computer files? Or would those brain waves be compelling evidence to convict?
It wasn't necessary to open and view the file. Comparing the bits with the previously known file would have been sufficient.
Even without that, it would have been enough to prosecute.
If the prosecutor had waited until the day of trial to open and view the file, would that have been a search right then and there in the courtroom?
Sorry but that's not true, AJD. No jury would (or at least, no jury should) convict just based on a hash match. The reason requires some math. (Note that the specific math depends on which hash algorithm is used. For simplicity, I'm going to use MD5.)
While it is true that the probability of any two files randomly matching in their hashes is 0.5^128 (a very small number), they are not just matching two particular files. They are looking for matches between any two files inside very large data sets. Microsoft is indiscriminately scanning all documents, not just ones that they have a prior reason to suspect. That means you have to account for the birthday paradox. So you actually have a 50% chance of a hash match between two unrelated files when the library you're hashing reaches 2^64 files. Okay, that's still a really big number - but it's not implausibly far from recent estimates of files created per year.
In other words, it's unlikely but still possible that the file you think is child porn is actually the electronic invoice for my new refrigerator.
1) Compare the hash of each suspect file on the PC with the hashes of the known files.
2a) If the hashes do not match, the suspect file is not a match with a known file.
2b) If the hashes match, then "Comparing the bits with the previously known file would have been sufficient." You do not have to display the image.
I'm not convinced that 1) is not a search given that you have to open the file and read all the bits to calculate the hash.
"2b) If the hashes match, then "Comparing the bits with the previously known file would have been sufficient." You do not have to display the image."
And nobody on the jury gets to actually see the picture, to confirm that the "previously known file" actually IS child porn? They just take the prosecutor's word for it?
My concern with the hash approach is that it would appear to be feasible to *tweak* any given file to have a matching hash. Though for all I know this might be very computationally intensive for some hash functions.
But, like DNA tests where the police are already in possession of the sequence of their desired defendant's DNA, you have to consider the possibility of deliberately spoofing the system in order to generate an excuse for a search.
I wish I lived in a country where the police were always too honest to do things like that, but I don't, I live in the US.
The jury could be shown the previously known file. (At this point there's a bit of a philosophical question as to what the difference is between showing them the previously known file and showing them the confiscated file, especially given that the file was copied over a network.)
Not sure what the concern is about tweaking a file. If you're worried about the police tampering with the file, that's something the defense can always claim whether you show the jury the file or not. If the possibility that police *might* have done something to tamper with the evidence constitutes reasonable doubt, then virtually no one can ever get convicted of anything.
All that said, when I posted above I thought they were using a secure hash. It turns out they almost certainly are not. One concern I would have about not comparing the bits, given that it's not a secure hash, is that someone could be given an innocuous file which was specially designed to have the same hash as child porn.
But if it were a secure hash, that'd be impossible, at least with currently known (and even currently speculated about) technology.
Drop the term "secure hash". It's a lie. There is only how much computation is required to fake it, and is that impractical for a given level of tech and mathematical understanding?
In this case, first of all, kudos to government for bothering to manually check what a computer told them rather than just blindly accepting it and going to arrest people. That's going to become an ever-bigger issue with AI and cameras feeding an omnipresent panopticon realtime database of everyone. This probably should never exist but that's a different fight.
Second, I'm not so keen on a series of little steps, each of which seems reasonable, adding up together as what seems like a search. We didn't take a drug sniffing dog into your house to search for drugs. He just came along. Atmosphere from inside happened to go inside his nose. Molecules quite reasonably impinged on receptors and sent that info to the brain. The dog reacted.
See? No search.
A hash (from the private entity) should be sufficient for a warrant. Running your own hash is a search of the file, much less opening it. Just get the warrant. It's not there to protect the guilty. It's there to protect The People from officials who want to hassle political opponents.
I don't see how running a hash is a search of the file.
If the file was on the guy's laptop, yes, running a hash on the file would be a search *of the laptop*. But in this case you're not even running a hash of the original file. You're running a hash of a copy (located on a police computer) of a copy (located on a Microsoft computer) of the original file.
Actually looking at the file would at least potentially be a search. After all, the image could be a scan of someone's medical records. But even that goes out the window if already know, based on the SHA256 of the file, that the file is not someone's medical records, and that it in fact is child porn.
And yes, I know the hash in question is not SHA256. I used that as an example instead of using the term "secure hash," which I think has a technical definition, but since I'm too lazy to fight over that I'll just give in on.
In this case, the hash is a PhotoDNA hash. I agree with Ken825 below, that whether or not it is a search depends on how likely it is that some private legal image you have might get mistakenly flagged as child porn. If it's virtually impossible, I'd say it's not a search (within the meaning of the Fourth Amendment). If it's unlikely but not virtually impossible, I'd say it probably is a search (within the meaning of the Fourth Amendment), and the police should get a warrant.
One more thing. Even if running a hash of a child porn image *is* a search of a file, it's not a search of a file that is owned by the suspect, as it is illegal to own child porn.
This is literally the exact thing that a secure hash is resistant against.
Of course, you have to keep up with the state of the art, which means upgrading your cryptographic suite every 5-10 years. But somehow the internet, web browsers and operating systems manage, I'm sure our child-porn search purveyors can manage that minimal technical competence.
Elsewhere in the thread, Toranth is suggesting that the 'hash' in question isn't a secure hash. That makes sense to me; if it was a secure hash suspects could make trivial changes and avoid matching. Heck, many web sites automatically manipulate images as an optimization thing; a suspect wouldn't even need to do it himself - or he could just trivially save it with different EXIF data or whatever.
It makes sense that the matching algorithm is trying to make some kind of visual appearance match. I suppose that's what google images is trying to do when they suggest 'Visually similar images'. If you've ever looked at that, it's a mix of things that are actually similar, and things that aren't even close.
One hopes that whatever matching program being used here has a much, much lower false positive rate than google images 'visualy similar' matcher, but we don't have those numbers. It's unlikely that false positive rate is zero, though, or it would miss many cases of actual kiddie porn.
You are splitting hairs. Reviewing the bits electronically is for all practical purposes the same as opening the file.
2^64 is 18,446,744,073,709,551,616. If everyone in the world (7 billion) created 1 million files per year, we'd have to wait more than 2,635 years before we'd have that many files.
Moreover, the birthday paradox is not relevant, because Microsoft is not looking for *any* match between two files, but only a match against a limited number of files identified as child porn.
That said, I suspect the "hash" being used by PhotoDNA is much much worse than MD5.
I think you're radically underestimating the number of files that are created in a year. Remember that lots of files are system-generated. The only question is whether they get stored (and where).
And, yeah, what I've learned about PhotoDNA since the post above makes the math MUCH more likely for a collision.
You think there are more than 7,000,000,000,000,000 (seven quadrillion) files created per year? I don't know. I doubt it.
MD5 is a crap hash, in part because it's small enough that a birthday attack collision is within the realm of imagination (and in part because it's broken).
SHA256 is what I was thinking of when I made my comment at the top of this thread. At 256 bits, even a birthday attack (2^128) is infeasible, and as I said above, this is not a situation where the "birthday paradox" is relevant, because they're not looking for *any match*, but a match against the set of known images identified as child porn.
The birthday paradox is still relevant (though to a somewhat lower degree) because there are a very large number of known images of child porn. Yeah, it makes the situation less likely - but not, in my opinion, unlikely enough to omit the trivial requirement of actually opening the file to confirm its contents.
No, the Birthday Problem isn't relevant here, for several reasons.
First, the large number of comparisons that is the heart of the Birthday Problem comes from the facts that all elements are being compared to all other elements. That isn't happening here; we have a moderate set (all images on the internet) being compared to a small set (all child porn images). So instead of a exponential increase in the likelihood, we have a mere linear one.
Second, your 'very large number of known images of child porn' is actually a very, VERY, small number. The number of known images is estimated in the millions range, although you may claim even billions. OneDrive, in this case, uses 1.8 million images. That's a mere 2^20 or 2^30 images. A hash space of 2^256 possible hashes is much, much larger. To make it clear, if you took a single random image and compared its hash to the known child porn hashes, you have roughly 1-in-2^230 chance of a matching hash.
That's a 0.000000000000000000 0000000000000000000 0000000000000000000 000000000001% chance of a match. (Should be 67 zeros before the 1).
That's close enough to zero.
(It seems words cannot be longer than 50 characters... odd restriction, O Forum Software).
"Comparing the bits with the previously known file would have been sufficient."
If they had compared the full files, that would be true. A hash by definition can not be unique to one given file.
Yes, comparing the full files is what I meant by "comparing the bits."
But how is this any different than 'opening and viewing' the file?
And if dogs can sniff human cancers, can a Microsoft OdorDNA air sampler be used by companies to reject job applicants that don't know they have cancer? Can Microsoft GaydarDNA be used to predict gay job applicants? Can Microsoft CriminalDNA alert police for a legal stop and frisk? As long as a machine/algorithm is doing the objective detection, does bias and discrimination evaporate?
Missing a few points:
Calculating a hash for a file requires the file to be "opened." Every bit of data within the file must be analyzed to obtain the hash. To obtain a hash, a cryptographic algorithm is executed on the contents of the entire file to obtain a large number. Opening the file with a viewing program merely allows a human being to see a visual representation of the data in a format that happens to be an image that a human brain can interpret.
In response to a prior comment, IF the cryptographic method used to create the hash is sufficiently robust, then yes, the hashes ARE necessarily unique. In fact, if the cryptographic method is sufficiently robust, no visual inspection is needed to confirm anything, the file in question IS THE SAME as the one that initially triggered the suspicion and notification by the third party.
Finally, it seems this case keys on the validity or reliability of that third party's assertion that the suspected file is indeed contraband. Does the assertion of the third party satisfy probable cause? Remember, even calculating a hash value IS opening the file, which is critical to the analysis here. If the government in this case double checks that the file in question does generate the same hash as the known file, the file has already been opened to do that. If the government does not calculate that hash themselves, then they rely totally on the assertion of the third party that the file they provided from the suspect is contraband.
"Calculating a hash for a file requires the file to be "opened." Every bit of data within the file must be analyzed to obtain the hash. To obtain a hash, a cryptographic algorithm is executed on the contents of the entire file to obtain a large number. Opening the file with a viewing program merely allows a human being to see a visual representation of the data in a format that happens to be an image that a human brain can interpret."
All correct. But it does not answer the legal question of whether that is a search.
My point being that technically, there is no difference between opening a file to calculate a hash, and opening a file to "look at" the contents rendered in such a manor to appease an eye-to-brain connection.
The author seems to spend a lot of time speaking of a hash as if it is not the same as opening the file. My point was to say that is not the case.
It seems to me that whether there is a difference between (a) opening a file to calculate a hash and (b) opening a file to look at the contents to appease an eye-to-brain connection depends on what legal standard the court is applying. If the court is applying a legal standard that depends what the computer does, not what a person sees, then you are right. If the court is applying a legal standard that hinges on what a person sees, not what a computer does, then you are wrong. It depends on what standard the law is required to apply, not the standard that makes the most sense to you. It turns out that the Fourth Amendment search doctrine traditionally depends on what a person has observed -- the eye-to-brain connection, as you put it -- not what the computer has done. That's why it's a hard question, I think.
I'm a programmer, not a lawyer. Technically this is a hard question. "Opening" a file is kind of vague, and there's another confounding issue, maybe. The hash was compared against a set of known hashes of child pornography. A hash tells you two files are (very) likely the same file. The next thing to do would be a binary comparison, of the image data, which would tell you the images are exactly the same file.
These hashes of known pornography were generated from different copies of the file sourced from somewhere else. A detective with access to the database of original files the hashes were generated from could open the copy of the file from the database to verify the contents visually. There would never be a need to open the file provided by Microsoft. If he'd done that, would it be a search?
If the hashes Microsoft has were government provided, does that impact the agent question?
I don't know what is settled law about copies of files. Files don't have physical world analogues. The file the officer viewed wasn't the file he had on disk, it was a copy in memory. The file he had on disk was copied multiple times before it made it to the officer. If there was a search, was it even a search of the defendant's property, or was it the detective's property by then? To some extent it would be like viewing a digital copy of scanned prints of an x-ray of someone's trunk after the trunk got a hit from Microsoft's drug sniffing dogs. Is that a search?
So the government has the honor of having a computer look at it and give them an answer that's more likely to be right than a manual eyeballing, yet it is only the eyeballing that is intrusive enough to be a search? But even that isn't a search by this case.
So they get to take a half step that's not a search, process by computer, and another half step, manual eyeballing, that isn't a search, when just opening it sans pre-step would be a search? I don't like that reasoning.
Why is the standard different? Particularly as algorithms get more and more sophisticated (think AI) the distinction between a human looking at the contents versus a human-written algorithm used by humans that reproduces the process the human uses is indistinguishable. Both actions mean the file is opened and its contents examined.
I don't get the scanned-by-eyeball vs. scanned-by-software distinction either.
If someone doesn't read my email with a Mark 1 eyeball, but instead just has a program scan it for the words 'cocaine', 'bomb', 'mary jane', underage' etc, that seems like a search to me.
I tried to look up exactly how 'PhotoDNA' performs their hashes, but it appears to be a proprietary method:
It appears to be an attempt to ID images regardless of image format, cropping, resolution changes, palette swapping, etc. In other words, it IDs similar images, not just identical ones. But I cannot find a word about how accurate it is, how secure it is... or even what the resulting 'hash' bitcount is.
Without that information being publicly available, I would say that it is absolutely necessary for a human to view the image. A secret, custom 'hash' algorithm just can't cut it - think of the Dilbert RNG.
Which means, the government needs to open it again using some public, reliable ID method. And that 'open again' will be acquiring new information. Namely, whether or not the presumed similar image (with an unknown measurement method and unknown error rate) is actually similar.
Ick. After reading that, I take back what I've written above. I was misled, and perhaps the court was as well, into thinking that the hash in question was unique to the file.
I too was bamboozled by the use of the word 'hash' to mean an industry-accepted cryptographic hash function. Instead we get this 🙁
@Orin, if it's not too much to ask, please do distinguish. I think the company is trading a bit on this ambiguity, imputing the mathematical properties of the former but instead selling the latter.
That's my thought too. Well put.
That really changes it. I'm not sure such a hash is at all accurate -- after all, hashes are supposed to vary greatly for small changes in the data being hashed.
Wish I'd read this before my other comment. Given this, viewing it definitely provides new information, and invalidates most of what I wrote.
That makes sense. Otherwise they could evade detection by randomizing the low-order bits in the image's color values, causing no noticeable difference in the image to a human eye but returning a totally different hash value.
On the other hand, it also means that if two files yield the same PhotoDNA hash, then they're similar visually. It's not going to confuse a ship with a plane (or for that matter, a ship with random pixel static) like MD5 can. Possibly legal pornography involving adults in similar poses would yield hashes colliding with illegal ones. It would be necessary for a human to view it to verify.
It also depends on what the "uniform size" mentioned above is. A smaller image size will be more resistant to minor obfuscating changes in the originals, but also cause more false collisions.
"On the other hand, it also means that if two files yield the same PhotoDNA hash, then they're similar visually."
That does not necessarily follow. In principle you could certainly generate files that hashed identically, and looked completely different; They're throwing away a LOT of visually significant data here.
OTOH, it does suggest that you're unlikely to randomly get files that hash the same and don't look the same. Apparently this was intended as a defense against steganography.
You are absolutely right. In fact, there are large numbers of papers on how to generate photos that fool neural networks, which use a far more advanced method to classify images than these fingerprinting methods.
See eg here.
Converts images into a common black-and-white format and uniform size . . ."
One problem I encounter professionally, as a photographer, is called color metamerism.
Identical digital data can represent perceptibly different colors?not infrequently, colors which are not subtly different, but dramatically so. Converting color data to black and white is one of the trickier problems.
After a black and white conversion, blues are rendered (like everything else) into shades of gray. Paler blues tend toward whites. Reds tend toward blacks. So your digital file data say this is a black square. Which was it originally, black or red? You don't know. Pale red will be light gray. So will somewhat darker blue. Identical digital data.
Visibly distinguishable colors (already an ambiguous concept) can and will render alike digitally, and thereafter be indistinguishable in subsequent files. Digital files are wildly ambiguous as color references. And every digital color rendering a file goes through changes the numbers, the colors displayed, or both, again and again, from camera sensor, to computer screen, to internet upload, to download, to printer, and so on.
Color science is a huge topic. This doesn't scratch the surface. But maybe it suggests caution with hash results compared from data sets where the birthday paradox looms ever larger as the reference data set expands.
Wow, that changes things significantly. It seems to me that the question of whether or not it is a search is dependent on the false positive rate. If the false positive rate is something like 1 out 2^1024, then I think it's not a search. If it is something like 1 out of 2^4, then it most clearly is a search.
I didn't find many PhotoDNA algorithm details online, but one article mentioned how it isn't affected by color space, image size, aspect ratio, or image rotation, and that it detects edges, which I'm guessing may be their way of isolating and identify objects in images.
Convolutional Neural Networks (CNNs), used in machine learning and artificial intelligence, do all those things when processing and analyzing images: Standardizing image sizes, converting color spaces to grayscale, detecting edges and other low-level features using filters, etc. Though the usual output from a CNN isn't a hash or digital fingerprint, rather a classification, i.e., is the image an igloo or a hot dog? Calling the PhotoDNA output an image summary is probably accurate if overbroad.
I don't think the PhotoDNA algorithm has been published for public inspection, which makes it hard to trust. I'm not a computer security expert, but friends working in that area tell me that if you don't publish your algorithm, the security community won't taking it seriously because the best way to test it is to let anyone and everyone try to break it. Mostly likely Microsoft has shared enough details about PhotoDNA with law enforcement and NGOs to make them trust it, but that doesn't mean it can't be fooled into giving false positives.
From the description on Wikipedia (yes, I know), the PhotoDNA tool tries to determine whether two assumed-to-be-different photos are actually the same. I would call that a fingerprint. It's certainly *not* a hash as the term is commonly used in cryptography or computer science. Duplicating even a weak cryptographic hash such as MD5 on non-identical files is theoretically possible, but in reality it's just not going to happen by accident (search for "collision attack" or "collision resistance" if you're curious). More likely, PhotoDNA calculates a fingerprint for an image, hashes that fingerprint, and that makes searching for a matching fingerprint much faster.
Regardless, PhotoDNA is processing input images to see whether they resemble known reference images closely enough to be "the same" even though they are not identical. It's the same concept as checking for plagiarism or identifying songs. It would give both false positives and false negatives with nonzero probabilities, and a human would still need to look at the file to be sure.
Another way to look at the problem, from a color management point of view. Images I make with my camera typically get stored on my computer in a format called camera raw. That's a choice I make; there are other choices which many use.
During color correction, I may convert the color profile of the image?which might have been initially Adobe RGB, or so-called camera RGB, to a different profile, maybe S-RGB. Many other choices are available for those values as well. There are even alternative color rendering systems, which in principal, and within gamut limits, can produce very similar-looking images from different data. Those include RGB, HSB, CMYK, and LAB.
The different profiles don't change the camera raw data. They do change the data in any exported, screen displayed, or subsequently printed versions of the same image. And guess what, every computer which receives a copy of my file, however configured, takes what it receives and applies its own?almost certainly different?color management system to the file, to make it compatible with the user's equipment. Except for my own deliberate manipulations initially, this all goes on in the background, without anyone's conscious intervention.
Do all these different data sets, which purport to render the same image, deliver identical hash values? Always? Mostly? Sometimes? Never?
In the usual meaning of hashing, those should deliver completely different hash values. Note that PhotoDNA is not doing the usual thing, however, and I think it should not have been called hashing at all.
Hashes/message-digests/checksums are all intended to map a large set X to a much smaller set Y. When used for data structures, a hash function should give a much different y (imagine y to be just an 32-bit integer, for example) for even a tiny difference in the input x. Furthermore, the difference in y should be as random as possible. Look up "hash tables" for more info.
When used for checksums, the difference in y doesn't need to be vastly different, and randomness is not necessary. In other words, a small difference in x should yield a difference in y, but we don't care if the difference is large, or random.
When used for message digests, the difference in y should be random. Furthermore, it should be computationally infeasible to reverse. In other words, given a y, it should be computationally infeasible to compute an x that yields that y.
1. Change the charge to felony stupid for uploading anything to Microsoft.
2. The file was sent by Microsoft as 'suspected' child porn based on an proprietary algorithm? Not sufficient (to me at least) to allow cops to open the file unless that algorithm has been proven in a court to be accurate enough to eliminate all reasonable doubt. Especially when the hash is not against the actual file, but against a series of manipulated files.
3. The term 'opened' applied to the file is not really significant. The original file was 'processed', shall we say electronically, then by a human. if the process is digital or analog should not matter. MS here is playing the role of the nosy neighbor calling the cops to say she saw something. Nothing is actually known until the image is viewed by a human, and a subjective definition of porn applied.
How is it that the National Center for Missing & Exploited Children (NCMEC) can have the largest library of child pornography and law enforcement does not arrest anyone there?
Why is this never mentioned in any of the criminal cases or appeals?
My local police station has an entire store room of stolen goods AND NO ONE HAS BEEN ARRESTED AT THAT STATION!
Why?
Just asking questions.
I heard they had DRUGS in there too!
Its not seized evidence. Nice try to dismiss that a non-profit has the largest library of child porn in the World.
Oops, "in principle," not "in principal."
If you accept what I tell you about digital picture files?which is that the file data routinely gets changed by automated processes in the course of normal use of the files?then that raises at least 4 questions about hash-related searches.
1. How fine is the resolution of a hash-related search when the samples it is trying to match are not in fact digitally identical. Especially, given comparisons from very large data sets, how much room does that create for false positives.
2. How accurately inclusive can the matches be, based on the hash algorithm chosen? What range of digital variations can it digest without a hiccup, to deliver reliably matches which are correct in light of human inspection, and not others?
3. How much human back-stopping actually occurs when hash-based searches are used?
4. Given today's dynamic imaging software development market, centered on automated picture file changes?notably seen in Apple's iphone 10?how long would legal doctrines reliant on hash algorithms likely remain useful, or faithfully related to actual practice?
I wonder by what means these questions could be avoided, unless there were both a certain amount of plus-or-minus tolerance already built into use of the hash algorithms, and also notable human intervention.
I suggest legal experts trying to define whether hash-based search techniques pass muster, might do well to explore answers to those questions.
If you look above, I tried to find out how PhotoDNA does their matching. The answer is that is ISN'T a 'hash' by any meaning a programmer or computer scientist would recognize. It's better not to think of it as one at all, and call it something like an 'image summary code'.
While the questions about hash-based searches and probability are something that the law is going to need to deal with, in this case it doesn't seem relevant.
Thank you for that. It was a great comment. The apparent discrepancy between what you described, and what I experience while managing, distributing, printing, and color correcting photographs, led me to try to contribute.
I don't know much about hash technology. I'm also not a color scientist. I do know more than I ever wanted to about where practical difficulties and ambiguities in computerized color management come from, and the frustrations they create.
What I know suggests problems, difficulties, and unexpected results from operating the kind of "image summary" system you aptly describe.
But maybe not. Now I feel I need to know more.
Oh, I agree; your questions are interesting, and someone will need to answer them before these sorts of behaviors become too common.
I think what should happen is that summary-matches with a reasonably low collision rate (no secret proprietary) could be used a 'dog-signals' - a sufficient reason to get a search warrant, but not proof of anything in-and-of-itself. Summaries, however, like PhotoDNA simply cannot be accurate enough to be considered proof without some other more precise method doing the final check.
On the other hand, a strong cryptographic hash match should be treated as a match in all other ways, even if no one else ever looks at the file. If you ever have a false positive, throw a party! No one living will ever see it happen again, so it should make world-wide news.
The wide difference between these two (absolutely untrustworthy and absolutely trustworthy) means to me that there isn't a lot of *technical* depth to discuss. Unfortunately, that doesn't translate into legal simplicity, though...
On the other hand, a strong cryptographic hash match should be treated as a match in all other ways, even if no one else ever looks at the file. If you ever have a false positive, throw a party! No one living will ever see it happen again, so it should make world-wide news.
That implies for me the following questions. To get that kind of match, the two image files must be pixel-by-pixel identical digitally. Right? Or not right?
If not right, what bounds the definition of a match?
If right, won't it be trivial to defeat that kind of surveillance?
This is where you get into probability.
A 256-bit hash is guaranteed to have multiple (different) files that can produce a specific hash, if space of possible files is larger than the hash.
However, if you selected a given image and produced a strong cryptographic hash of it, then took the entire population of Earth and dedicated each person to producing a new image every second to see if it had a matching hash, then the Sun will die long before we have a 1% chance of finding a match.
It IS trivial to defeat a hash comparison, though, yes. Changing as little as one bit, anywhere in the image would produce a different hash. To human eyes, there are billions of 'different' files that will all have images that the human cannot tell apart.
That is why strong hashes aren't used in cases like this - instead they need to use other methods, which will necessarily be much more prone to error.
I suppose the key takeaway here is, never use cloud services for unencrypted data.
But I didn't need this case to tell me that.
Had a Microsoft employee opened the suspect images and confirmed that one or more were a visual depiction of sexually explicit conduct involving a minor, then reported it to the government, that would be a private search, and the private search reconstruction doctrine would allow the government to recreate but not exceed the search. But that wasn't the case here.
Prior to viewing the suspect images, the government agent knew a few things about them, notably that their PhotoDNA hash function values matched some PhotoDNA hash function values of known-bad NCMEC images. No concrete conclusions can be draw from that, however. It's enough to raise suspicion, and if I were the government agent I'd think it worth my time, even my responsibility, to investigate, but there is a process for this.
Opening and viewing the images confirmed the suspicion that one or more of the images contained a visual depiction of sexually explicit conduct involving a minor, but that is new information.
Seems like a search to me.
Seems like there are two ways to approach the question:
1) Encrypt all files uploaded to a server. Better still, compress with password then encrypt. That should destroy any correlation between the uploaded file and a known "bad hombre."
2) When a third party's analytics show a possible correlation that they subsequently report to law enforcement (which, of course, they disclose in their ToCs) then LE can use a good faith reliance on that determination as probable cause to obtain a warrant to actually "place eyes" on the file in question.
Whether or not a 3rd party's search should be (as opposed to 'is') considered a violation of 4A privacy protections - assuming one actually believes there is a right to privacy - or a violation of one's rights to be secure in their effects from unreasonable searches is another question.
BTW, encryption and/or password protection should be sufficient to assert a reasonable expectation of privacy - again assuming that a right to privacy actually exists.
The right to privacy seem to mostly apply to bad cops hiding body cam recordings and personnel records
Can one embed a child pornography image into another image and extract it later? Would the hash then be detectable? That is, in effect, visual encryption. The crime would be decrypting and possessing the decrypted file rather than possessing the undecrpyted file, no?
I'm reminded of the figure-ground drawing of the beautiful young lady or old hag. Can adding a layer of ambiguation to child pornography render it "legal"?
This isn't encryption, it's steganography.
I don't know, I think I'd view steganography as just a particular branch of encryption.
Presumably, if you hid child porn in an innocent looking photo by steganography, the photo fingerprinting software here would report back the innocent photo.
The general rule here is to just remember that "the cloud" just means "other people's computers", and don't upload anything to the cloud you'd you'd object to everybody in the world seeing.
That's like saying horseback riding is a branch of canoeing. They have entirely different mathematical models :-/
How is it that the non-profit National Center for Missing & Exploited Children (NCMEC) can have the largest library of child pornography?