简体繁体 English

Java中的PDF文本提取

[英]PDF text extraction in Java

原文 2018-07-11 08:04:30 6 2 java/ parsing/ pdf

I have a PDF file that was produced with iText and created with JasperReports (I don't know if it's relevant) and I was wondering if I can find some API or anything to see the structure because I need to extract text from it. 我有一个用iText生成并用JasperReports创建的PDF文件（我不知道它是否相关），我想知道是否可以找到一些API或任何东西来查看结构，因为我需要从中提取文本。

I tried with iText , PDFBox and other Java libraries but I only get text line by line and that's not what I need . 我尝试使用iText ， PDFBox和其他Java库，但是我只能逐行获取文本，而这不是我所需要的 。
I also tried conversion in HTML, XML, DOM but I get the same result with text extraction, no structure parsed. 我还尝试了HTML，XML，DOM的转换，但通过文本提取得到了相同的结果，没有解析任何结构。
If I try to open it as DOCX I see that Word recognize sort of structure, for example an area that looks like a table in PDF, after conversion in DOCX it is actually a table. 如果我尝试以DOCX格式打开它，我会看到Word识别某种结构，例如，一个区域看起来像PDF中的表格，在DOCX中转换后，它实际上是一个表格。

I need to understand how the PDF was created, if this is possible. 如果可能，我需要了解如何创建PDF。 I know that working with PDF's is not easy, but I need to start with something useful. 我知道使用PDF并不容易，但是我需要从一些有用的东西开始。 Thanks! 谢谢！

2 个解决方案

One more option, we can extract from Aspose PDF also, if you want look into the below link 另外一种选择是，如果您想查看以下链接，我们也可以从Aspose PDF中提取

https://blog.aspose.com/2018/02/28/extract-text-by-paragraphs-and-convert-files-to-pdf-with-aspose.pdf/ https://blog.aspose.com/2018/02/28/extract-text-by-paragraphs-and-convert-files-to-pdf-with-aspose.pdf/

PDFTron PDFGenie can do full semantic table and paragraph extraction from a PDF file. PDFTron PDFGenie可以从PDF文件提取完整的语义表和段落。 It can generate a reflowable HTML file containing all the appropriate HTML tags for tables and paragraphs. 它可以生成包含表和段落的所有适当HTML标记的可重排HTML文件。

See this blog for more details. 有关更多详细信息，请参见此博客。 https://www.pdftron.com/blog/parsing-extraction/table-extraction-and-pdf-to-xml-with-pdfgenie/#a-idpart7aevaluating-accuracy-of-pdf-table-recognition https://www.pdftron.com/blog/parsing-extraction/table-extraction-and-pdf-to-xml-with-pdfgenie/#a-idpart7aevaluating-accuracy-of-pdf-table-recognition

You can download Windows/macOS/Linux PDFGenie command line tool here. 您可以在此处下载Windows / macOS / Linux PDFGenie命令行工具。 https://www.pdftron.com/downloads/linux https://www.pdftron.com/downloads/linux