简体   繁体   中英

Is it possible to extract text from PDF, whose “Page Extraction” is not allowed?

I am able to extract text from PDF's which doesn't have any security restrictions. I just want to know if it is possible to extract text from PDF which has restrictions

在此处输入图片说明

UPDATE:

Thanks to all for your comments. I appreciate your concern. Please understand the question. I did not ask how to do it. I just want to know if it is possible. I have created a PDF with these restrictions. I do not want my information to be extracted from my document. There are many developers who can achieve any task. I want to know if this task can be done. If this can be done, then I will investigate further to overcome this issue.

As the OP clarified that he asked the question to know whether his documents with such restrictions are safe from text extraction, and that he does not ask how to do it (in spite of the explicit languages and libraries given in tags), here an answer on the principle option, not a concrete implementation. Thus...

Yes, it is possible to extract text from documents with restrictions as long as the document can be read at all and no other means are applied to prevent text extraction.

The restrictions you show merely are flags that indicate to a PDF processor what the author wants to allow or not to allow a user to do with his document but they are not technical restrictions.

These restrictions can only be applied to encrypted documents, but you surely want these restrictions to work in particular for anyone (other than you) who can open the document for reading, be it by knowing a specific user password or be it by using the empty password.

Cf. the specification ISO 32000 (here from part 2, similarly in part 1 with a focus on PDF viewers):

If a user attempts to open an encrypted document that has a user password, the PDF reader shall first try to authenticate the encrypted document using the padding string defined in 7.6.4.3, "File encryption key algorithm" (default user password):

  • If this authentication attempt is successful, the PDF reader may open, decrypt, render and otherwise provide access to the document.

  • If this authentication attempt fails, the interactive PDF processor should prompt for a password. Correctly supplying either password (owner or user password) should enable the user to gain access to the document.

Whether additional operations shall be allowed on a decrypted document depends on which password (if any) was supplied when the document was opened and on any access restrictions that were specified when the document was created:

  • Opening the document with the correct owner password should allow full (owner) access to the document. This unlimited access includes the ability to change the document's passwords and access permissions.

  • Opening the document with the correct user password (or opening a document with the default password) should allow additional operations to be performed according to the user access permissions specified in the document's encryption dictionary.

Access permissions shall be specified in the form of flags corresponding to the various operations and the set of operations to which they correspond shall depend on the security handler's revision number (also stored in the encryption dictionary).

...

Once the document has been opened and decrypted successfully, a PDF reader technically has access to the entire contents of the document. There is nothing inherent in PDF encryption that enforces the document permissions specified in the encryption dictionary. PDF readers shall respect the intent of the document creator by restricting user access to an encrypted PDF file according to the permissions contained in the file.

(ISO 32000-2 section 7.6.4 Standard Security Handler)

Thus, these restrictions only work in cooperating PDF processors, but in particular in case of open source PDF libraries, it is trivial for a programmer to remove any code trying to enforce the restrictions.

Being aware of this, the developers of open source PDF libraries usually don't try to enforce the restrictions at all, or they add some flag to override restriction enforcement to prevent patched copies of the library to circulate.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM