The Volokh Conspiracy
Mostly law professors | Sometimes contrarian | Often libertarian | Always independent
Experience Writing Code to Read & Modify PDF Files?
Do any of you folks have any experience writing code to read and modify PDF files? Obviously, one would use a suitable preprogrammed package for this, and there are some such, but that's sometimes harder than it looks. If any of you can offer some tips (for a worthy cause), for instance which packages are best and how they can best be used, that would be great; please e-mail me at volokh at law.ucla.edu. Many thanks!
Editor's Note: We invite comments and request that they be civil and on-topic. We do not moderate or assume any responsibility for comments, which are owned by the readers who post them. Comments do not represent the views of Reason.com or Reason Foundation. We reserve the right to delete any comment for any reason at any time. Comments may only be edited within 5 minutes of posting. Report abuses.
Please
to post comments
I don't personally have experience but I know there are ghostscript libraries, pdfjam, mupdf, pdflib, qpdf and others.
It might help to know what language you want to use or are you looking for someone to code something for you (I recommend upwork or whatever it's called now for that).
You would normally do that with a SDK which is a package of code that provides a set of tools for a programmer to work with or on a particular application. The most obvious but not easiest way would be to do it with Adobe's own library but there are countless closed and open source usually higher level ways to create and edit pdfs.
https://opensource.adobe.com/dc-acrobat-sdk-docs/pdflsdk/pdfloverview.pdf
This isn't the lead-in to a "learn to code" joke, is it?
>Install Gentoo
t. Richard Stallman
There are a number of different tools to do this, I've done it using VBA put also Python has the ability to do it ... generally Adobe's apis are very good I've found in any language. Some of the comments above mentioned specific packages.
In 1990, while playing around with Westlaw at 2 a.m. in the law library, I discovered that you could add or subtract text to any court opinion you pulled up. Adding "not" in various places, for example, would dramatically change the stare decisis effect of Marbury v. Madison. And after it was printed out no one could tell that it had been changed.
This was before the era of email, before anything got saved as a pdf.
Those were the good old days.
Maybe moving to Sumantra (open source), and using Python would do what you are looking to do.
I use Adobe Pro for about $15/month. Gives you a lot of tools besides editing PDFs, such as entering text in non-fillable PDFs, and doing redaction.
Tsk tsk. You must be sleepy this morning.
Writing your own code and asking about the best package are contradictory. Usually, your posts are much clearer than that.
As others said, PDF is deliberately hard because the authors don't want their stuff to be modified by third parties. If the author is able to describe the file as software, modification of the text would violate the DFCA or CFAA. The word processor is the platform and the author's prose is the final step to code that software.
I meant package in the sense of a library of code that performs basic operations, such as reading material from the PDF, writing to the PDF, etc.; the point of such packages is precisely so others can write their own code using the packages, right?
I've never targetted PDF creation directly but on windows have had great success using print-to-PDF. First with externally provided printer drivers and now with the one available through win10.
In the past I've used iText PDF which is java library. It was reasonably straight-forward to use as these things go, however, we were using it to generate reports from scratch. Modifying an existing file might be more tricky depending on what you are trying to do, and what sort of mess the original PDF writer made. If you are simply trying to add content (watermark, header content, etc), that shouldn't be too difficult. On the other end of the spectrum, if you are trying to say change font size, and rewrap text, that will be extremely difficult.
Also, it is now dual-licensed under AGPL and Commercial, and they don't publish the commercial pricing, you have to request a quote. If those licenses don't work for your use, it looks like OpenPDF is a fork of the software split off from the last LGPL version.
". . . (for a worthy cause). . . . "
Everything is a worthy cause to someone.
Future historians (on the winning side), will determine if it was indeed worthy.
Here is something I have learned as a systems administrator since pdfs became "a thing" in IT. Just pay Adobe their blood money. It is far easier then any other "solution" or "workaround" out there. Adobe is a giant pain in the rear, but their products work and do what you need them to do. Sure if you have time to try makeshift open source solutions, you might be able to find one that works for you, but just using the Adobe product is going to save you a lot of that valuable time and frustration.
The question could come down to how clean the pdf files are. If they were produced by a text editor it would be pretty easy with several of the already mentioned packages above but off a scanner with a dirty platen through OCR and you could be in for a long day of manual correcting. Sometimes on scanned docs without built in OCR it's best to export the pdf to picture files and run photo filters to clean them up before running them through OCR. It's been several years since I last did it however so newer tools may have better built in filters.
If you want to make large changes, rather than the odd annotation or fill in box, the best way for many pdfs is to convert them to a word file, alter and reconvert to a pdf. There are many pdf to word coverters out there and most will have free trials. They dont do well on complicated pdfs, e.g. with lots of pictures and tables. This will not work for scanned pdfs where you will need a OCR. Dont try and code your own pdf editor, therein madness lies.
My thought too, StephenM
If you want to go the python route I'd recommend pdfminer, pypdf, or pdfplumber to extract the text from the pdfs, then using Regex to extract the text you'd like to focus on.
The answer is going to very strongly depend on the answers to two questions:
1) What is in the PDFs you're interested in
A) are the PDFs text based. Is all the text of the original document in the PDF, or
B) Are the PDFs you want to do things to largely image-type pdfs where the file does not necessarily contain the text of the document at all, or perhaps OCRed text derived from the primary source of a (probably scanned) image
The programming techniques to deal with those two things are very different and few people have experience in both.
2) What are you wanting to do?
Indexing, searching stacks of PDFs for strings of text or finding page numbers in single pdf documents is a very different thing than re-flowing a document, changing font-size or margins or etc and coorelating files based on meta-data which may be contained in the file such as author or origination date, etc is again a different thing.
If you came to me as a pro, my answer would be "Poorly defined request, go back to the customer and ask for more information on what they want. We might bid on it, we might not."