The Volokh Conspiracy
Mostly law professors | Sometimes contrarian | Often libertarian | Always independent
Avoid Super-Embarrassing Redaction Failures
A Public Service Announcement, especially for the lawyers among our readers.
[I first posted a version of this post in 2020, but I've seen the problem enough since to think it was worth mentioning again.]
I have often run across documents written by lawyers that looked redacted—but all the supposedly secret information in them could be extracted with literally three keystrokes (ctrl-A, ctrl-C, ctrl-V). One was a court filing that was filed pursuant to a court order authorizing the redaction; but the material so carefully marked secret proved not to be secret at all.
Another carefully tried to hide the real name of a litigant whom the lawyer was trying to keep pseudonymous; but the name was one copy-and-paste away from being visible. What's more, when the documents were posted online in searchable spaces, search engines indexed the supposedly hidden material, so searching for the real name would find the document in which the lawyer had been trying to redact the name.
For at least one of the documents, I know what improper redaction mechanism was used: The lawyer used Google Docs to highlight passages using black highlighter, and then saved the document as a PDF. That looked blacked out on the screen; but the underlying text still remained in the PDF document—as far as the software was concerned, the text wasn't removed but was just set in a different color. (Something similar would happen with Microsoft Word.)
By clicking ctrl-A in PDF, I selected the whole document. (You can also just select the passage that contains the redactions.) By clicking ctrl-C, I copied the selected text to the clipboard. And then by clicking ctrl-V in another app, I pasted it with all the formatting, including the highlighting, removed. (In some situations, it takes a ctrl-shift-V.) The text was then completely visible. Commenter anorlunda on an earlier post explained the problem well:
Users are trained WYSIWYG. What you see is what you get. That's brilliant marketing, but when you make black text on a black background, what you see is nothing, but what you get is something else. So redaction contradicts our training.
To the best of my knowledge, Adobe Acrobat Pro redaction actually deletes the underlying text, if you mark the text for redaction and then apply the redactions. I'm sure there is other software available to do this, including free software. Just make sure that whatever you do, the redaction is actually complete.
Of course, the most reliable redaction mechanism (because it tends to be less likely to involve user error in the use of even excellent redaction software) is still printing, blacking out the material completely, and then scanning it back into a new file. [UPDATE: Two commenters caution that even this might not work, because the highlighter may be a different shade of black than the text; I'm unaware of any instances where the text was recovered from a photocopy of a page where the text was fully blocked out by a black-looking highlighter, but I can't vouch for that never happening.] But this option won't work for court filings in the many courts that require full-text-searchable PDFs generated directly from the computer, rather than from a scanner.
Editor's Note: We invite comments and request that they be civil and on-topic. We do not moderate or assume any responsibility for comments, which are owned by the readers who post them. Comments do not represent the views of Reason.com or Reason Foundation. We reserve the right to delete any comment for any reason at any time. Comments may only be edited within 5 minutes of posting. Report abuses.
Please
to post comments
I don't think the premise of the last paragraph is correct. If the black that one uses to redact the printed out version is not exactly the same shade of black as the printed material, it is possible to recover the redacted material, even if the naked eye can't tell. Using an actual redact software tool — or, I suppose, scissors — is the only guaranteed way to make sure it cannot be recovered.
Yeah, it sounds like using the virtual black highlighter, then printing and re-scanning would be a better way to go.
Why don’t you just use the tool that is designed to do this!
I was fully on board up to here. But using a physical black highlighter can create a similar problem to doing it in a word processor: what looks like a solid black box to the naked eye may still leave the information recoverable through enhancing contrast in image editing software, especially with the resolutions most modern scanners are capable of. On the other hand, I’m not aware of any vulnerabilities with mainstream software-based redaction, as long as you actually use the real function.
Are there courts that actually require that? If so, I know some older practitioners who are in for a rough time…
With respect to Prof. Volokh’s update, I can verify that the increase contrast option is at least sometimes possible in documents hand redacted and scanned, because I’ve done it (albeit not on court filings). If you redact, photocopy, and then scan you’re probably okay: photocopiers tend to be relatively low resolution, and obviously you can’t really recover data that isn’t there. But I’m still not sure why you’d go that route, when:
1. There’s no real way to be certain that you got everything;
2. The software-based redaction is much easier as or more secure; and
3. The non-redacted parts of your PDF will look much better and be much easier for your readers to use if you use a software generated document instead of a scanned one.
I don't believe there are. There are ones that require that scanned documents be OCRed so that they are text searchable, but not that require that the documents be computer generated in the first place.
Fair enough—frankly, I’m not sure I remember the last time I used a scanner that didn’t OCR PDFs automatically. I think some modern efiling platforms do too, although it looks like CM/ECF (including NextGen) doesn’t. While looking it up, though, I found that the district of DC actually has instructions not to OCR your submissions!
Presumably if your OCR software is decent I'm not sure how about would be the wiser.
Regarding the highlighting ... IANAL but I have seen printed documents which are readable if you hold them slanted just right under a strong light, and sometimes if you do the same on the back side. It might be good enough if you then photocopy the page, or take a picture of it, making sure the light isn't from the side at just that right slant. But I would never trust that.
For a similar article, see https://www.americanbar.org/groups/judicial/publications/judges_journal/2019/spring/embarrassing-redaction-failures/.
As noted by other commenters (and now by E.V. in the mainstream), printing a physical paper copy and Magic Markering it to black out is not a viable solution. And yes, some (now many?) courts do require fulltext searchable PDF (and all will do so soon).
Low-tech solution: (i) Use your text editor (preferrably LibreOffice Writer) to create PDF; (ii) then use a graphics editor (preferrably GIMP) to black out the naughty bits, exporting it to an unsearchable PDF; (iii) then use an OCR program (preferrably ocrmypdf, on Linux) to make a new searchable PDF. There are online services to do step (iii), but that exposes the naughty bits to the online service.
Hi-tech solution: Use LibreOffice Writer/Draw, which has a redaction feature to create redacted searchable PDF, see https://www.techrepublic.com/videos/how-to-use-document-redaction-in-libreoffice-6-3-0-4/.
Oops, wrote too fast. Step (iii) doesn't expose naughty bits, but does expose the non-naughty bits, which you may/probably not want exposed outside of courts.
Better references (than the YouTube video):
https://help.libreoffice.org/latest/en-US/text/shared/guide/redaction.html?DbPAR=WRITER
https://help.libreoffice.org/latest/en-US/text/shared/guide/auto_redact.html?DbPAR=WRITER
Some other issues:
1. Metadata. I think there was a post here about the client's true name being in the history of an apparently pseudonymous document.
2. Formatting. Everybody can see how big the secret word is and make some guesses about what it is. The technical problem is not hard. Use variable substitution so $plaintiff turns into John Doe in the public version and Donald Trump in the private version. (I don't know how this works in fashionable software. I could do it in LaTeX if I could remember how many backslashes to use.) But then your document has different formatting in public and private versions.
If it is only one or a few letter combinations, e.g. "John Doe", simply do a replace "John Doe" with "------------" (without quotes in both cases). You can have any number of "-" you wish and then depending on court rules, you can either leave it as ---------- or then black it out. Remember to use "save as" and not "save", and enter something in the file name (e.g. "redacted") so you have a copy with the name in it and another without it.
Not sure about the metadata though.
John Carr's metrics bug is well-taken, and can be defeated by using the low/hi solution I mentioned, adding extra spaces to defeat the bug. That makes the original Word/Writer doc look funny (bad spacing), but nobody's supposed to see that anyway (except the doc creators, and court if/when it requires an unredacted copy).
Without "widow/orphan" protection, you could mess up your page numbering, i.e. what is on what page.
Actually, this could happen WITH it as well.
Not so (if I understand what you're saying). The original/Writer doc "looks" the same as the PDF, except one has goofy-looking spaces where the other has black boxes. The unredacted stuff is in the same page position in both.
I see there's now even a cottage industry for redaction (incorporating AI, of course). https://www.redactable.com.
My objection to such methods are that they require you to upload the sensitive document to a third party.
One useless redaction method I've encountered with editable pdfs - someone just puts a black oblong over the text to be redacted and saves. The next person can simply remove the oblong when editing the pdf.
You can load the scanned document into photoshop page-by page page. Then paint over the redacted portions with full black at 100% transparency and save the result as either and image file or a new pdf page. Then in Acrobat (or equivalent) read in the page files and combine into one document.
For a two or three letter this is a relatively painless way to be sure that the redaction cannot be reversed.
This morning I had redacted and sanitized a pdf document with Acrobat Pro. I just checked the resulting document and could find no hidden text. Just be sure with Acrobat to not only redact, but also Sanitize to remove hidden data
BTW I was idly wondering whether there were circumstances where one might intentionally redact badly so that the opposition could discover what you had redacted but were hence misled.
While not strictly a redaction matter, I still vividly recall an incident during the Lehman Brothers bankruptcy when someone at Cleary Gottlieb Steen & Hamilton used the Excel "hide" function, rather than the "delete row" function, to edit a list of assets being transferred -- and as the list of assets was being converted to the court-required format, the hidden rows came back into view and a set of "toxic" assets which Cleary Gottlieb's client quite definitely did not intend to acquire was nevertheless transferred to it. (I say "someone" because while a very junior associate was blamed, quiet gossip was that a much more senior attorney had made the underlying error, and no one had been willing to risk annoying him by correcting it.) I remember this vividly because shortly before that debacle came to light, I was assigned to work on a case where Cleary Gottlieb was opposing counsel and had filed a motion lambasting our competence in handling . . . Excel spreadsheets.
I do statistical consulting for my bread money and excel “hide” is just the worst. I tell people I’m not on their projects and can’t see sensitive info about subjects. I load their excel file up in R and…Oh look! I have names, address, phone numbers for 400 people that I didn’t f*ing want.
INAL but does there have to be a black box? Is it possible for a lawyer to just to whatever program made the form (latex??) and replace the name with dashes or an autogenerated black box? It feels weird that the original has to be edited after the fact rather than just recompiled with redacted info r———d (for example).
Side Note: as a statistician the final conclusion is the only true way to keep data safe is to randomize it at the point of collection. Best case scenario is data is handed to evil guy and nothing happens because not a single number indicates anything firm.
Example of a Fail: You want to get an idea of how many students cheat on an exam. So 10 students are asked if they cheated on an exam, answers recorded, the proportions of yeah’s reported. If someone knew 9/10 of the students they could work back to student 10 cheating or not. Even more bluntly, someone could steal our notebook with the responses in it.
Example of a Win: Instead of a student saying yes/no to “did you cheat?” you have them flip a (slightly weighted) coin. Heads + student cheated is a 1 but a tail + student didn’t cheat is also a 1. Likewise head+didn’t cheat = 0 and tail+cheat=0. I can give you a full list of students and their responses and you can’t do a damn thing with the data for any individual. Nor is there any key/cryptography that can get to an individuals cheating habits. Our analysis gets weaker but the safety is expansive.
It’s kind of beautiful in its own way
*coin has to be weighted by a known proportion which influences how much safety vs info we want.
PDF Expert from Readdle has an Redact tool that completly and permamntly removes the selected text from the saved PDF. There is no way to recover the text. See: https://support.readdle.com/pdfexpert/en_US/edit-pdfs/remove-sensitive-content
While the help file mentions something about a subscription only feature, I have a full perpetual license and the feature works fine without any subscription.
When I used to work in digital records management, we used software that was primarily used for scanning, or could import PDFs, which converted the PDF pages as image files. When we had to apply redactions, it was done as a modification to the image itself, so when the document was compiled as a PDF there was literally no text there for the OCR engine to recognize. A bit old school compared to some of the tools out there today.
One problem with modern technology is that a lot of people don't understand the layers that can go into digital images or documents - while you may see one thing on the surface, from a digital perspective there are more than the two dimensions you see on the surface of a sheet of paper.
Another fav is when redactions are partial words due to lack of precision on part of redactor, pronouns are left unredacted when the options of possible people are limited, and redactions are incomplete. In short, lack of attention to detail can make the whole exercise pointless.
I know I once saw a redacted document with *one* instance of the victim's name left in clear text. It was from multiple years ago, so I figured the best thing I could do was to leave it alone rather than bring it to anyone's attention and maybe Streisand Effect it.