简体   繁体   中英

Extracting text from PDF

I am attempting to extract text from a PDF file using the code found here . The code employs the zlib library.

AFAICT the program works by finding blocks of memory between occurrences of the text "stream" and "endstream" in the pdf file. These chunks are then inflated by zlib.

The code works perfectly on one sample pdf document, but on another it appears that the zlib's inflate() function returns -3 (Z_DATA_ERROR) every time it is called.

I noticed that, the pdf file that fails, is set so that when opened in Adobe reader, there is no "copy" option. Could this be related to the inflate() error?... and if it is, is there a way around the problem?

Code snippet below - see comments

            //Now use zlib to inflate:
            z_stream zstrm; ZeroMemory(&zstrm, sizeof(zstrm));

            zstrm.avail_in = streamend - streamstart + 1;
            zstrm.avail_out = outsize;
            zstrm.next_in = (Bytef*)(buffer + streamstart);
            zstrm.next_out = (Bytef*)output;

            int rsti = inflateInit(&zstrm);
            if (rsti == Z_OK)
            {
                int rst2 = inflate (&zstrm, Z_FINISH); // HERE IT RETURNS -3
                if (rst2 >= 0)
                {
                    //Ok, got something, extract the text:
                    size_t totout = zstrm.total_out;
                    ProcessOutput(fileo, output, totout);
                }
            }

EDIT: I tested text extraction from the "encrypted" pdf via an online pdf-to-text converter called zamzar , and the resulting text file was perfect. So either zamzar has some super-duper decrypting system... or perhaps its just not very difficult.

EDIT: Just found that A-pdf also converted to text without problems.

Streams in PDF need not be encoded with flate. They could be encoded with:

  1. Nothing
  2. LZW
  3. Flate
  4. ASCII85
  5. Crypt (which could be one of several different algorithms)

And (surprise, surprise) any of these methods could also be layered on top of each other!

If there is no copy option, chances are it is encrypted with an owner password and no user password. This allows the author to create access permissions that are supposed to be honored by a reader including:

  1. Modifying the document contents
  2. Copying text/graphics
  3. Adding/editing annotations
  4. Printing
  5. Form filling
  6. Assembling the document (insert, delete pages, creating bookmarks, thumbnails)
  7. High/low quality print

This particular approach to getting text out of a PDF is fraught with error and I can supply you with a set of documents that you won't be able to work with with your approach because of font re-encoding, split up text, oddball locations, form XObjects, unusual transformations, and so on.

To do this properly, you need a better set of tools that aren't blind to the actual format and structure of a PDF document. iText will do this, DotImage will do this.

To give you an idea of the scope of the problem, I wrote the original text search code in Acrobat 1.0 and with all the internal tools available to me, it took me many months to get it right and the code included the ability to find text in unusual, non-rectilinear orientations (think maps), handling ligatures, re-encoding, non-roman fonts, and so on. While I was working on that code, there was another engineer who was dedicated full time for several years writing code called Wordy to do something similar (but more complicated) for full-text extraction and indexing (see this answer for more information about Wordy).

If there´s no "copy" option then the pdf is encrypted and so is the stream. Plain zlib won't work, you'll have to decrypt the pdf first and now that you are at it use a proper library to extract text, there's a lot of encoding to take care, not everything is win ansi.

这是可能的,因为标题与文档的不同之处在于,为此,请参阅ZLib Inflate()与-3 Z_DATA_ERROR失败的相关问题。

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM