简体繁体 English

是否可以从不允许“页面提取”的PDF中提取文本？

[英]Is it possible to extract text from PDF, whose “Page Extraction” is not allowed?

原文 2018-06-24 14:03:25 6 1 java/ python/ itext/ pdftotext/ pypdf2

I am able to extract text from PDF's which doesn't have any security restrictions. 我能够从没有任何安全限制的PDF提取文本。 I just want to know if it is possible to extract text from PDF which has restrictions 我只想知道是否可以从具有限制的PDF中提取文本

UPDATE: 更新：

Thanks to all for your comments. 感谢大家的评论。 I appreciate your concern. 我感谢您的关注。 Please understand the question. 请理解这个问题。 I did not ask how to do it. 我没有问怎么做。 I just want to know if it is possible. 我只想知道是否有可能。 I have created a PDF with these restrictions. 我创建了具有这些限制的PDF。 I do not want my information to be extracted from my document. 我不希望从文档中提取信息。 There are many developers who can achieve any task. 有许多开发人员可以完成任何任务。 I want to know if this task can be done. 我想知道是否可以完成此任务。 If this can be done, then I will investigate further to overcome this issue. 如果可以做到，那么我将作进一步研究以克服这个问题。

1 个解决方案

As the OP clarified that he asked the question to know whether his documents with such restrictions are safe from text extraction, and that he does not ask how to do it (in spite of the explicit languages and libraries given in tags), here an answer on the principle option, not a concrete implementation. 正如OP澄清的那样，他问这个问题是要知道他的受此限制的文档是否可以安全地进行文本提取，并且他不问该怎么做（尽管标签中提供了明确的语言和库），这里是一个答案关于原则选择，而不是具体实施。 Thus... 从而...

Yes, it is possible to extract text from documents with restrictions as long as the document can be read at all and no other means are applied to prevent text extraction. 是的，只要完全可以阅读文档，并且可以使用其他方法来防止文本提取，则可以从文档中提取文本，但有限制。

The restrictions you show merely are flags that indicate to a PDF processor what the author wants to allow or not to allow a user to do with his document but they are not technical restrictions. 您显示的限制仅是向PDF处理器指示作者希望允许或不允许用户处理其文档的标志，但这不是技术限制。

These restrictions can only be applied to encrypted documents, but you surely want these restrictions to work in particular for anyone (other than you) who can open the document for reading, be it by knowing a specific user password or be it by using the empty password. 这些限制只能应用于加密的文档，但是您一定希望这些限制特别适用于可以打开该文档以供阅读的任何人（您自己除外），无论是通过知道特定的用户密码还是通过使用空白密码。

Cf. cf. the specification ISO 32000 (here from part 2, similarly in part 1 with a focus on PDF viewers): 规范ISO 32000（此处来自第2部分，与第1部分类似，重点是PDF查看器）：

If a user attempts to open an encrypted document that has a user password, the PDF reader shall first try to authenticate the encrypted document using the padding string defined in 7.6.4.3, "File encryption key algorithm" (default user password): 如果用户尝试打开具有用户密码的加密文档，则PDF阅读器应首先尝试使用7.6.4.3“文件加密密钥算法”（默认用户密码）中定义的填充字符串对加密文档进行身份验证：

If this authentication attempt is successful, the PDF reader may open, decrypt, render and otherwise provide access to the document. 如果此身份验证尝试成功，则PDF阅读器可以打开，解密，渲染或以其他方式提供对文档的访问。

If this authentication attempt fails, the interactive PDF processor should prompt for a password. 如果此身份验证尝试失败，则交互式PDF处理器应提示输入密码。 Correctly supplying either password (owner or user password) should enable the user to gain access to the document. 正确提供密码（所有者密码或用户密码）应该使用户能够访问文档。

Whether additional operations shall be allowed on a decrypted document depends on which password (if any) was supplied when the document was opened and on any access restrictions that were specified when the document was created: 是否应允许对解密后的文档进行其他操作取决于打开文档时提供的密码（如果有）以及创建文档时指定的任何访问限制：

Opening the document with the correct owner password should allow full (owner) access to the document. 使用正确的所有者密码打开文档应该允许完全（所有者）访问该文档。 This unlimited access includes the ability to change the document's passwords and access permissions. 这种无限制的访问包括更改文档密码和访问权限的能力。

Opening the document with the correct user password (or opening a document with the default password) should allow additional operations to be performed according to the user access permissions specified in the document's encryption dictionary. 使用正确的用户密码打开文档（或使用默认密码打开文档）应允许根据文档加密字典中指定的用户访问权限执行其他操作。

Access permissions shall be specified in the form of flags corresponding to the various operations and the set of operations to which they correspond shall depend on the security handler's revision number (also stored in the encryption dictionary). 访问许可应以与各种操作相对应的标志的形式指定，它们所对应的一组操作应取决于安全处理程序的修订版号（也存储在加密字典中）。

... ...

Once the document has been opened and decrypted successfully, a PDF reader technically has access to the entire contents of the document. 一旦成功打开并解密了文档，PDF阅读器从技术上就可以访问该文档的全部内容。 There is nothing inherent in PDF encryption that enforces the document permissions specified in the encryption dictionary. PDF加密中没有固有的内容可以强制执行加密词典中指定的文档权限。 PDF readers shall respect the intent of the document creator by restricting user access to an encrypted PDF file according to the permissions contained in the file. PDF阅读器应通过根据文件中包含的权限限制用户对加密的PDF文件的访问来尊重文档创建者的意图。

(ISO 32000-2 section 7.6.4 Standard Security Handler) （ISO 32000-2第7.6.4节“标准安全处理程序”）

Thus, these restrictions only work in cooperating PDF processors, but in particular in case of open source PDF libraries, it is trivial for a programmer to remove any code trying to enforce the restrictions. 因此，这些限制仅在协作的PDF处理器中有效，但是特别是在开放源PDF库的情况下，对于程序员而言，删除任何试图实施这些限制的代码都是微不足道的。

Being aware of this, the developers of open source PDF libraries usually don't try to enforce the restrictions at all, or they add some flag to override restriction enforcement to prevent patched copies of the library to circulate. 意识到这一点，开源PDF库的开发人员通常根本不尝试实施限制，或者他们添加一些标志来覆盖限制实施，以防止发布修补的库副本。