Experience Writing Code to Read & Modify PDF Files?

The Volokh Conspiracy

Mostly law professors | Sometimes contrarian | Often libertarian | Always independent

Do any of you folks have any experience writing code to read and modify PDF files? Obviously, one would use a suitable preprogrammed package for this, and there are some such, but that's sometimes harder than it looks. If any of you can offer some tips (for a worthy cause), for instance which packages are best and how they can best be used, that would be great; please e-mail me at volokh at law.ucla.edu. Many thanks!

Start your day with Reason. Get a daily brief of the most important stories and trends every weekday morning when you subscribe to Reason Roundup.

NEXT: Public Access to Records in Patent Cases

Hide Comments (19)

Editor's Note: We invite comments and request that they be civil and on-topic. We do not moderate or assume any responsibility for comments, which are owned by the readers who post them. Comments do not represent the views of Reason.com or Reason Foundation. We reserve the right to delete any comment for any reason at any time. Comments may only be edited within 5 minutes of posting. Report abuses.

Peter Gerdes 4 years ago

I don't personally have experience but I know there are ghostscript libraries, pdfjam, mupdf, pdflib, qpdf and others.

It might help to know what language you want to use or are you looking for someone to code something for you (I recommend upwork or whatever it's called now for that).

Log in to Reply
AmosArch 4 years ago

You would normally do that with a SDK which is a package of code that provides a set of tools for a programmer to work with or on a particular application. The most obvious but not easiest way would be to do it with Adobe's own library but there are countless closed and open source usually higher level ways to create and edit pdfs.

https://opensource.adobe.com/dc-acrobat-sdk-docs/pdflsdk/pdfloverview.pdf

Log in to Reply
Jerry B. 4 years ago

This isn't the lead-in to a "learn to code" joke, is it?

Log in to Reply
1. RabbiHarveyWeinstein 4 years ago
  
  >Install Gentoo
  t. Richard Stallman
  
  Log in to Reply
Aladdin's Carpet 4 years ago

There are a number of different tools to do this, I've done it using VBA put also Python has the ability to do it ... generally Adobe's apis are very good I've found in any language. Some of the comments above mentioned specific packages.

Log in to Reply
Dan Schiavetta 4 years ago

In 1990, while playing around with Westlaw at 2 a.m. in the law library, I discovered that you could add or subtract text to any court opinion you pulled up. Adding "not" in various places, for example, would dramatically change the stare decisis effect of Marbury v. Madison. And after it was printed out no one could tell that it had been changed.

This was before the era of email, before anything got saved as a pdf.

Those were the good old days.

Log in to Reply
Commenter_XY 4 years ago

Maybe moving to Sumantra (open source), and using Python would do what you are looking to do.

Log in to Reply
Darth Chocolate 4 years ago

I use Adobe Pro for about $15/month. Gives you a lot of tools besides editing PDFs, such as entering text in non-fillable PDFs, and doing redaction.

Log in to Reply
Archibald Tuttle 4 years ago

Tsk tsk. You must be sleepy this morning.

Writing your own code and asking about the best package are contradictory. Usually, your posts are much clearer than that.

As others said, PDF is deliberately hard because the authors don't want their stuff to be modified by third parties. If the author is able to describe the file as software, modification of the text would violate the DFCA or CFAA. The word processor is the platform and the author's prose is the final step to code that software.

Log in to Reply
1. Eugene Volokh 4 years ago
  
  I meant package in the sense of a library of code that performs basic operations, such as reading material from the PDF, writing to the PDF, etc.; the point of such packages is precisely so others can write their own code using the packages, right?
  
  Log in to Reply
Soronel Haetir 4 years ago

I've never targetted PDF creation directly but on windows have had great success using print-to-PDF. First with externally provided printer drivers and now with the one available through win10.

Log in to Reply
pavon 4 years ago

In the past I've used iText PDF which is java library. It was reasonably straight-forward to use as these things go, however, we were using it to generate reports from scratch. Modifying an existing file might be more tricky depending on what you are trying to do, and what sort of mess the original PDF writer made. If you are simply trying to add content (watermark, header content, etc), that shouldn't be too difficult. On the other end of the spectrum, if you are trying to say change font size, and rewrap text, that will be extremely difficult.

Also, it is now dual-licensed under AGPL and Commercial, and they don't publish the commercial pricing, you have to request a quote. If those licenses don't work for your use, it looks like OpenPDF is a fork of the software split off from the last LGPL version.

Log in to Reply
apedad 4 years ago

". . . (for a worthy cause). . . . "

Everything is a worthy cause to someone.

Future historians (on the winning side), will determine if it was indeed worthy.

Log in to Reply
Jimmy the Dane 4 years ago

Here is something I have learned as a systems administrator since pdfs became "a thing" in IT. Just pay Adobe their blood money. It is far easier then any other "solution" or "workaround" out there. Adobe is a giant pain in the rear, but their products work and do what you need them to do. Sure if you have time to try makeshift open source solutions, you might be able to find one that works for you, but just using the Adobe product is going to save you a lot of that valuable time and frustration.

Log in to Reply
TangoDelta 4 years ago

The question could come down to how clean the pdf files are. If they were produced by a text editor it would be pretty easy with several of the already mentioned packages above but off a scanner with a dirty platen through OCR and you could be in for a long day of manual correcting. Sometimes on scanned docs without built in OCR it's best to export the pdf to picture files and run photo filters to clean them up before running them through OCR. It's been several years since I last did it however so newer tools may have better built in filters.

Log in to Reply
StephenM 4 years ago

If you want to make large changes, rather than the odd annotation or fill in box, the best way for many pdfs is to convert them to a word file, alter and reconvert to a pdf. There are many pdf to word coverters out there and most will have free trials. They dont do well on complicated pdfs, e.g. with lots of pictures and tables. This will not work for scanned pdfs where you will need a OCR. Dont try and code your own pdf editor, therein madness lies.

Log in to Reply
1. Paladin_44 4 years ago
  
  My thought too, StephenM
  
  Log in to Reply
Bartikus 4 years ago

If you want to go the python route I'd recommend pdfminer, pypdf, or pdfplumber to extract the text from the pdfs, then using Regex to extract the text you'd like to focus on.

Log in to Reply
Rick Boatright 4 years ago

The answer is going to very strongly depend on the answers to two questions:

1) What is in the PDFs you're interested in

A) are the PDFs text based. Is all the text of the original document in the PDF, or

B) Are the PDFs you want to do things to largely image-type pdfs where the file does not necessarily contain the text of the document at all, or perhaps OCRed text derived from the primary source of a (probably scanned) image

The programming techniques to deal with those two things are very different and few people have experience in both.

2) What are you wanting to do?

Indexing, searching stacks of PDFs for strings of text or finding page numbers in single pdf documents is a very different thing than re-flowing a document, changing font-size or margins or etc and coorelating files based on meta-data which may be contained in the file such as author or origination date, etc is again a different thing.

If you came to me as a pro, my answer would be "Poorly defined request, go back to the customer and ask for more information on what they want. We might bid on it, we might not."

Log in to Reply

Please log in to post comments

The Volokh Conspiracy

Experience Writing Code to Read & Modify PDF Files?

Latest

Missouri Harasses AI Companies Over Chatbots Dissing Glorious Leader Trump

Argentina's Former President Gets 6 Years and a Lifetime Political Ban

In Just 1 Year, 134 Lifeguards Cost Los Angeles Taxpayers $70 Million

New 30 Percent Tariff Threats

U.N. Report Blames Israel and Capitalism for the Conflict in Gaza

Recommended

Login Form

The Volokh Conspiracy

Latest

Missouri Harasses AI Companies Over Chatbots Dissing Glorious Leader Trump

Argentina's Former President Gets 6 Years and a Lifetime Political Ban

In Just 1 Year, 134 Lifeguards Cost Los Angeles Taxpayers $70 Million

New 30 Percent Tariff Threats

U.N. Report Blames Israel and Capitalism for the Conflict in Gaza

Recommended

Special Offer!