简体繁体 English

以Java文件格式读取PDF，并使“ PDF”可编辑

[英]Reading PDF in java as a file and making “PDF” editable

原文 2012-09-06 21:14:25 2 2 java/ pdf

I have a program which will be used for building questions database. 我有一个程序将用于建立问题数据库。 I'm making it for a site that want user to know that contet was donwloaded from that site. 我正在为希望用户知道该网站已下载contet的网站而努力。 That's why I want the output be PDF - almost everyone can view it, almost nobody can edit it (and remove eg footer or watermark, unlike in some simpler file types). 这就是为什么我希望输出为PDF的原因-几乎每个人都可以查看它，几乎没有人可以编辑它（并删除页脚或水印，这与某些简单的文件类型不同）。 That explains why it HAS to be PDF. 这也解释了为什么它是PDF。

This program will be used by numerous users which will create new databases or expand existing ones. 该程序将被众多用户使用，这些用户将创建新数据库或扩展现有数据库。 That's why having output formed as multple files is extremly sloppy and inefficient way of achieving what I want to achieve (it would complicate things for the user). 这就是将输出形成为多个文件的原因，这是极其松散且效率低下的方式，无法实现我想要实现的目标（这会使用户复杂化）。

And what I want to do is to create PDF files which are still editable with my program once created. 我要做的是创建PDF文件，创建后仍可使用我的程序进行编辑。

I want to achieve this by implementing my custom file type readable with my program into the output PDF. 我想通过将我的程序可读取的自定义文件类型实现为输出PDF来实现此目的。

I came up with three ways of doing that: 我想出了三种方法：

Attach the file to PDF and then corrupting the part of PDF which contains it in a way it just makes the PDF unaware that it contains the file, thus making imposible for user to notice it (easely). 将文件附加到PDF，然后破坏其中包含它的PDF部分，使其仅使PDF意识不到它包含文件，从而使用户无法（轻松地）注意到它。 Upon reading the document I'd revert the corruption and extract file using one of may PDF libraries. 阅读文档后，我将使用可能的PDF库之一还原损坏并提取文件。
Hide the file inside an image which would be added to the PDF somwhere on the first or last page, somehow (that is still need to work out) hidden from the public eye. 将文件隐藏在图像中，该图像将被添加到首页或最后一页的PDF位置，以某种方式（仍然需要解决）从公众眼中隐藏起来。 Knowing it's location, it should be relativley easy to retrieve it using PDF library. 知道它的位置，应该很容易使用PDF库检索它。
I have learned that if you add "%" sign as a first character in line inside a PDF, the whole line will be ignored (similar to "//" in Java) by the PDF reader (atleast Adobe reader), making possible for me to add as many lines as I want to the PDF (if I know where, and I do) whitout the end user being aware of that. 我了解到，如果您在PDF的行首添加“％”符号，则PDF阅读器（至少是Adobe阅读器）将忽略整行（类似于Java中的“ //”），从而可以我要向PDF添加尽可能多的行（如果我知道，我知道），那么最终用户就会意识到这一点。 I could implement my whole custom file into PDF that way. 我可以将整个自定义文件实现为PDF。 The problem here is that I actually have to read the PDF using one of the Java's input readers, but I'm not sure which one. 这里的问题是我实际上必须使用Java的输入阅读器之一来阅读PDF，但是我不确定是哪一个。 I understand that PDF can't be read like a text file since it's a binary file (Right?). 我了解PDF不能像文本文件一样读取，因为它是二进制文件（对吗？）。

In the end, I decided to go with the method number 3. Unless someone has any better ideas, and the conditions are: 1. One file only. 最后，我决定采用方法3。除非有人有更好的主意，而且条件是：1.仅一个文件。 And that file is PDF. 该文件为PDF。 2. User must not be aware of the addition. 2.用户一定不知道添加的内容。

The problem is that I don't know how to read the PDF as a file (I'm not trying to read it as a PDF, which I would do using a PDF library). 问题是我不知道如何将PDF读取为文件（我不是想将其读取为PDF，而要使用PDF库来读取）。

So, does anyone have a better idea? 那么，有人有更好的主意吗？
If not, how do I read PDF as a FILE , so the output is array of characters (with newline detection), and then rewrite the whole file with my content addition? 如果不是，如何将PDF作为FILE读取，因此输出为字符数组（带有换行检测），然后用添加的内容重写整个文件？

2 个解决方案

In Java, there is no real difference between text and binary files, you can read them both as an inputstream. 在Java中，文本文件和二进制文件之间没有真正的区别，您可以将它们作为输入流读取。 The difference is that for binary files, you can't really create a Reader for it, because that assumes there's a way to convert the byte stream to unicode characters, and that won't work for PDF files. 区别在于，对于二进制文件，您不能真正为其创建一个Reader，因为它假定存在一种将字节流转换为unicode字符的方法，并且不适用于PDF文件。

So in your case, you'd need to read the files in byte buffers and possibly loop over them to scan for bytes representing the '%' and end-of-line character in PDF. 因此，在您的情况下，您需要读取字节缓冲区中的文件，并可能遍历它们以扫描表示PDF中'％'和行尾字符的字节。

A better way is to use another existing way of encoding data in a PDF: XMP tags. 更好的方法是使用另一种在PDF中编码数据的方法：XMP标签。 This is allows any sort of complex Key-Value pairs to be encoded in XML and embedded in PDF's, JPEGs etc. See http://partners.adobe.com/public/developer/en/xmp/sdk/XMPspecification.pdf . 这允许将任何类型的复杂键/值对都以XML编码并嵌入到PDF，JPEG等中。请参见http://partners.adobe.com/public/developer/zh/xmp/sdk/XMPspecification.pdf 。

There's an open source library in Java that allows you to manipulate that: http://pdfbox.apache.org/userguide/metadata.html . Java中有一个开放源代码库，可让您对其进行操作： http : //pdfbox.apache.org/userguide/metadata.html 。 See also a related question from another guy who succeeded in it: custom schema to XMP metadata or http://plindenbaum.blogspot.co.uk/2010/07/pdfbox-insertextract-metadata-frominto.html 另请参阅成功的另一个人的相关问题： XMP元数据的自定义架构或http://plindenbaum.blogspot.co.uk/2010/07/pdfbox-insertextract-metadata-frominto.html

It's all just 1's and 0's - just use RandomAccessFile and start reading. 全部都是1和0-只需使用RandomAccessFile并开始阅读。 The PDF specification defines what a valid newline character(s) is/are (there are several). PDF规范定义了有效的换行符是（多个）。 Grab a hex editor and open a PDF and you can at least start getting a feel for things. 抓住一个十六进制编辑器并打开一个PDF，您至少可以开始对事情有所了解。 Be careful of where you insert your lines though - you'll need to add them towards the end of the file where they won't screw up the xref table offsets to the obj entries. 但是，请注意将行插入的位置-您需要将它们添加到文件的末尾，这样它们就不会将外部参照表的偏移量固定到obj条目。

Here's a related question that may be of interest: PDF parsing file trailer 这是一个可能有趣的相关问题： PDF解析文件预告片

I would suggest putting your comment immediately before the startxref line. 我建议将您的评论放在startxref行之前。 If you put it anywhere else, you could wind up shifting things around and breaking the xref table pointers. 如果将其放在其他位置，则可能会四处移动并破坏外部参照表指针。

So a simple algorithm for inserting your special comment will be: 因此，用于插入您的特殊评论的简单算法是：

Go to the end of the file Search backwards for startxref Insert your special comment immediately before startxref - be sure to insert a newline character at the end of your special comment Save the PDF 转到文件末尾向后搜索startxref在startxref之前立即插入特殊注释-确保在特殊注释末尾插入换行符保存PDF

You can (and should) do this manually in a hex editor. 您可以（并且应该）在十六进制编辑器中手动执行此操作。

Really important: are your users going to be saving changes to these files? 真的很重要：您的用户是否要保存对这些文件的更改？ ie if they fill in the form field, are they going to hit save? 即，如果他们填写表格字段，他们会点击保存吗？ If they are, your comment lines may be removed during the save (and different versions of different PDF viewers could behave differently in this regard). 如果是这样，则可能会在保存期间删除您的注释行（在这方面，不同版本的PDF查看器的行为可能会有所不同）。

XMP tags are the correct way to do what you are trying to do - you can embed entire XML segments, and I think you'd be hard pressed to come up with a data structure that couldn't be expressed as XML. XMP标记是完成您要尝试做的事情的正确方法-您可以嵌入整个XML段，而且我认为很难提出无法表示为XML的数据结构。

I personally recommend using iText for this, but I'm biased (I'm one of the devs). 我个人建议为此使用iText，但我有偏见（我是开发人员之一）。 The iText In Action book has an excellent chapter on embedding XMP data into PDFs. 《 iText In Action》一书中有出色的章节介绍了如何将XMP数据嵌入PDF。 Here's some sample code from the book (which I definitely recommend): http://itextpdf.com/examples/iia.php?id=217 这是本书中的一些示例代码（我绝对推荐）： http : //itextpdf.com/examples/iia.php?id=217