简体   繁体   English

无法读取Java中pdf文件的生成文本

[英]Cannot read generated text of pdf file in Java

I am trying to read the text in Java and it isn't doing well. 我正在尝试阅读Java中的文本,但效果不佳。 Here is my code 这是我的代码

PDFTextStripper pdfStripper = null;
PDDocument pdDoc = null;
COSDocument cosDoc = null;
File pdfFile = new File("1.pdf");
PDFParser parser = new PDFParser(new RandomAccessFile(pdfFile,"rw"));
parser.parse();
cosDoc = parser.getDocument();
pdfStripper = new PDFTextStripper();
pdDoc = new PDDocument(cosDoc);
pdfStripper.setStartPage(1);
pdfStripper.setEndPage(5);
String parsedText = pdfStripper.getText(pdDoc);
System.out.println(parsedText);

But the result like this 但是这样的结果

Please wait... 请耐心等待...

If this message is not eventually replaced by the proper contents of the document, your PDF viewer may not be able to display this type of document. 如果此消息最终没有被文档的正确内容替代,则您的PDF查看器可能无法显示此类文档。

You can upgrade to the latest version of Adobe Reader for Windows®, Mac, or Linux® by visiting http://www.adobe.com/go/reader_download . 您可以通过访问http://www.adobe.com/go/reader_download_cn升级到适用于Windows®,Mac或Linux®的Adobe Reader的最新版本。

For more assistance with Adobe Reader visit http://www.adobe.com/go/acrreader . 有关Adobe Reader的更多帮助,请访问http://www.adobe.com/go/acrreader

Windows is either a registered trademark or a trademark of Microsoft Corporation in the United States and/or other countries. Windows是Microsoft Corporation在美国和/或其他国家的注册商标或商标。 Mac is a trademark of Apple Inc., registered in the United States and other countries. Mac是Apple Inc.在美国和其他国家/地区的注册商标。 Linux is the registered trademark of Linus Torvalds in the US and other countries. Linux是Linus Torvalds在美国和其他国家/地区的注册商标。

I found this error occurred because of xfa pdf document. 我发现由于xfa pdf文档而发生此错误。 But I don't know about xfa format of my pdf document. 但是我不知道我的pdf文档的xfa格式。 So please Let me know how can I know about xfa format. 因此,请让我知道如何了解xfa格式。

Someone help me please. 请有人帮我。 Thank you! 谢谢!

To sum up what has been said or hinted at in the comments... 总结评论中所说或暗示的内容...

The text quoted by the OP, OP引用的文字,

Please wait... 请耐心等待...

If this message is not eventually replaced by the proper contents of the document, your PDF viewer may not be able to display this type of document. 如果此消息最终没有被文档的正确内容替代,则您的PDF查看器可能无法显示此类文档。

... ...

is the content of the single PDF page Adobe software commonly puts into PDFs with a pure XFA form. 是单个PDF页面的内容,Adobe软件通常以纯XFA格式将其放入PDF。

XFA forms constitute an alternative way to describe forms in PDFs. XFA表单构成了描述PDF中表单的一种替代方法。 In contrast to the AcroForm way, XFA forms only use PDFs as an envelope carrying a XML stream describing properties, behavior, and values of the form in a way unrelated to any other PDF structure. 与AcroForm方式相反,XFA表单仅将PDF用作信封,其中以描述与其他任何PDF结构无关的方式描述表单的属性,行为和值的XML流。

Thus, many PDF processors offer a rudimentary support for XFA forms only (or none at all), the main exception being (obviously) Adobe products. 因此,许多PDF处理器仅对XFA表单提供基本支持(或根本不提供),主要例外是(显然)Adobe产品。

As a result XFA has been marked deprecated in the current PDF specification ISO 32000-2. 结果,在当前的PDF规范ISO 32000-2中XFA被标记为不推荐使用。


In case of PDFBox the XFA support is restricted to the feature of retrieval of the XFA XML data. 对于PDFBox,XFA支持仅限于检索XFA XML数据的功能。 Text extraction using the PdfTextStripper and related classes only operates on the regular PDF content and, therefore, only retrieves the text reported by the OP. 使用PdfTextStripper和相关类进行的文本提取仅适用于常规PDF内容,因此,仅检索OP报告的文本。

To access the content of XFA forms, you can retrieve the XFA resource using PDAcroForm.getXFA() . 要访问XFA表单的内容,可以使用PDAcroForm.getXFA()检索XFA资源。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM