简体   繁体   中英

Reading PDF in java as a file and making “PDF” editable

I have a program which will be used for building questions database. I'm making it for a site that want user to know that contet was donwloaded from that site. That's why I want the output be PDF - almost everyone can view it, almost nobody can edit it (and remove eg footer or watermark, unlike in some simpler file types). That explains why it HAS to be PDF.

This program will be used by numerous users which will create new databases or expand existing ones. That's why having output formed as multple files is extremly sloppy and inefficient way of achieving what I want to achieve (it would complicate things for the user).

And what I want to do is to create PDF files which are still editable with my program once created.

I want to achieve this by implementing my custom file type readable with my program into the output PDF.

I came up with three ways of doing that:

  1. Attach the file to PDF and then corrupting the part of PDF which contains it in a way it just makes the PDF unaware that it contains the file, thus making imposible for user to notice it (easely). Upon reading the document I'd revert the corruption and extract file using one of may PDF libraries.

  2. Hide the file inside an image which would be added to the PDF somwhere on the first or last page, somehow (that is still need to work out) hidden from the public eye. Knowing it's location, it should be relativley easy to retrieve it using PDF library.

  3. I have learned that if you add "%" sign as a first character in line inside a PDF, the whole line will be ignored (similar to "//" in Java) by the PDF reader (atleast Adobe reader), making possible for me to add as many lines as I want to the PDF (if I know where, and I do) whitout the end user being aware of that. I could implement my whole custom file into PDF that way. The problem here is that I actually have to read the PDF using one of the Java's input readers, but I'm not sure which one. I understand that PDF can't be read like a text file since it's a binary file (Right?).

In the end, I decided to go with the method number 3. Unless someone has any better ideas, and the conditions are: 1. One file only. And that file is PDF. 2. User must not be aware of the addition.

The problem is that I don't know how to read the PDF as a file (I'm not trying to read it as a PDF, which I would do using a PDF library).

  1. So, does anyone have a better idea?
  2. If not, how do I read PDF as a FILE , so the output is array of characters (with newline detection), and then rewrite the whole file with my content addition?

In Java, there is no real difference between text and binary files, you can read them both as an inputstream. The difference is that for binary files, you can't really create a Reader for it, because that assumes there's a way to convert the byte stream to unicode characters, and that won't work for PDF files.

So in your case, you'd need to read the files in byte buffers and possibly loop over them to scan for bytes representing the '%' and end-of-line character in PDF.

A better way is to use another existing way of encoding data in a PDF: XMP tags. This is allows any sort of complex Key-Value pairs to be encoded in XML and embedded in PDF's, JPEGs etc. See http://partners.adobe.com/public/developer/en/xmp/sdk/XMPspecification.pdf .

There's an open source library in Java that allows you to manipulate that: http://pdfbox.apache.org/userguide/metadata.html . See also a related question from another guy who succeeded in it: custom schema to XMP metadata or http://plindenbaum.blogspot.co.uk/2010/07/pdfbox-insertextract-metadata-frominto.html

It's all just 1's and 0's - just use RandomAccessFile and start reading. The PDF specification defines what a valid newline character(s) is/are (there are several). Grab a hex editor and open a PDF and you can at least start getting a feel for things. Be careful of where you insert your lines though - you'll need to add them towards the end of the file where they won't screw up the xref table offsets to the obj entries.

Here's a related question that may be of interest: PDF parsing file trailer

I would suggest putting your comment immediately before the startxref line. If you put it anywhere else, you could wind up shifting things around and breaking the xref table pointers.

So a simple algorithm for inserting your special comment will be:

Go to the end of the file Search backwards for startxref Insert your special comment immediately before startxref - be sure to insert a newline character at the end of your special comment Save the PDF

You can (and should) do this manually in a hex editor.

Really important: are your users going to be saving changes to these files? ie if they fill in the form field, are they going to hit save? If they are, your comment lines may be removed during the save (and different versions of different PDF viewers could behave differently in this regard).

XMP tags are the correct way to do what you are trying to do - you can embed entire XML segments, and I think you'd be hard pressed to come up with a data structure that couldn't be expressed as XML.

I personally recommend using iText for this, but I'm biased (I'm one of the devs). The iText In Action book has an excellent chapter on embedding XMP data into PDFs. Here's some sample code from the book (which I definitely recommend): http://itextpdf.com/examples/iia.php?id=217

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM