简体   繁体   English

从PDF中提取文本

[英]Extracting text from PDF

I am attempting to extract text from a PDF file using the code found here . 我试图使用此处的代码从PDF文件中提取文本。 The code employs the zlib library. 该代码使用zlib库。

AFAICT the program works by finding blocks of memory between occurrences of the text "stream" and "endstream" in the pdf file. AFAICT程序的工作原理是在pdf文件中找到文本“stream”和“endstream”的出现之间的内存块。 These chunks are then inflated by zlib. 然后通过zlib对这些块进行充气。

The code works perfectly on one sample pdf document, but on another it appears that the zlib's inflate() function returns -3 (Z_DATA_ERROR) every time it is called. 代码在一个示例pdf文档上完美运行,但在另一个示例中,zlib的inflate inflate()函数每次调用时都会返回-3(Z_DATA_ERROR)。

I noticed that, the pdf file that fails, is set so that when opened in Adobe reader, there is no "copy" option. 我注意到,失败的pdf文件被设置为在Adobe Reader中打开时没有“复制”选项。 Could this be related to the inflate() error?... and if it is, is there a way around the problem? 这可能与inflate()错误有关吗?...如果是,是否有解决问题的办法?

Code snippet below - see comments 下面的代码片段 - 请参阅注释

            //Now use zlib to inflate:
            z_stream zstrm; ZeroMemory(&zstrm, sizeof(zstrm));

            zstrm.avail_in = streamend - streamstart + 1;
            zstrm.avail_out = outsize;
            zstrm.next_in = (Bytef*)(buffer + streamstart);
            zstrm.next_out = (Bytef*)output;

            int rsti = inflateInit(&zstrm);
            if (rsti == Z_OK)
            {
                int rst2 = inflate (&zstrm, Z_FINISH); // HERE IT RETURNS -3
                if (rst2 >= 0)
                {
                    //Ok, got something, extract the text:
                    size_t totout = zstrm.total_out;
                    ProcessOutput(fileo, output, totout);
                }
            }

EDIT: I tested text extraction from the "encrypted" pdf via an online pdf-to-text converter called zamzar , and the resulting text file was perfect. 编辑:我通过名为zamzar的在线pdf到文本转换器测试了“加密”pdf中的文本提取,结果文本文件非常完美。 So either zamzar has some super-duper decrypting system... or perhaps its just not very difficult. 所以要么zamzar有一些超级解密系统......或者也许它不是很难。

EDIT: Just found that A-pdf also converted to text without problems. 编辑:刚刚发现A-pdf也没有问题转换为文本。

Streams in PDF need not be encoded with flate. PDF中的流不需要用flate编码。 They could be encoded with: 它们可以编码为:

  1. Nothing 没有
  2. LZW LZW
  3. Flate Flate
  4. ASCII85 ASCII85
  5. Crypt (which could be one of several different algorithms) Crypt(可能是几种不同算法之一)

And (surprise, surprise) any of these methods could also be layered on top of each other! 而且(惊喜,惊喜)这些方法中的任何一种都可以叠加在一起!

If there is no copy option, chances are it is encrypted with an owner password and no user password. 如果没有复制选项,则可能使用所有者密码和用户密码进行加密。 This allows the author to create access permissions that are supposed to be honored by a reader including: 这允许作者创建应该由读者尊重的访问权限,包括:

  1. Modifying the document contents 修改文档内容
  2. Copying text/graphics 复制文本/图形
  3. Adding/editing annotations 添加/编辑注释
  4. Printing 印花
  5. Form filling 填表
  6. Assembling the document (insert, delete pages, creating bookmarks, thumbnails) 组装文档(插入,删除页面,创建书签,缩略图)
  7. High/low quality print 高/低质量打印

This particular approach to getting text out of a PDF is fraught with error and I can supply you with a set of documents that you won't be able to work with with your approach because of font re-encoding, split up text, oddball locations, form XObjects, unusual transformations, and so on. 这种从PDF中获取文本的特殊方法充满了错误,我可以为您提供一组文档,由于字体重新编码,分割文本,奇怪的位置,您将无法使用这些文档,形成XObjects,异常转换等。

To do this properly, you need a better set of tools that aren't blind to the actual format and structure of a PDF document. 要正确执行此操作,您需要一组更好的工具,这些工具不会对PDF文档的实际格式和结构视而不见。 iText will do this, DotImage will do this. iText会这样做,DotImage会这样做。

To give you an idea of the scope of the problem, I wrote the original text search code in Acrobat 1.0 and with all the internal tools available to me, it took me many months to get it right and the code included the ability to find text in unusual, non-rectilinear orientations (think maps), handling ligatures, re-encoding, non-roman fonts, and so on. 为了让您了解问题的范围,我在Acrobat 1.0中编写了原始文本搜索代码,并且使用了所有可用的内部工具,我花了好几个月的时间来完成它并且代码包括查找文本的能力在不寻常的非直线方向(思考地图),处理连字,重新编码,非罗马字体等。 While I was working on that code, there was another engineer who was dedicated full time for several years writing code called Wordy to do something similar (but more complicated) for full-text extraction and indexing (see this answer for more information about Wordy). 当我正在编写代码的时候,还有另一位工程师专门花了几年的时间编写一个名为Wordy的代码,为全文提取和索引编写类似的东西(但更复杂)(有关Wordy的更多信息,请参阅此答案 ) 。

If there´s no "copy" option then the pdf is encrypted and so is the stream. 如果没有“复制”选项,则pdf被加密,流也是如此。 Plain zlib won't work, you'll have to decrypt the pdf first and now that you are at it use a proper library to extract text, there's a lot of encoding to take care, not everything is win ansi. 普通的zlib不起作用,你必须首先解密pdf,现在你正在使用一个合适的库来提取文本,有很多编码需要注意,并不是所有的都是win ansi。

这是可能的,因为标题与文档的不同之处在于,为此,请参阅ZLib Inflate()与-3 Z_DATA_ERROR失败的相关问题。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM