简体繁体 English

使用java从pdf中识别和提取表格

[英]Identify and extract table from pdf using java

原文 2017-03-31 10:30:01 6 2 pdf/ itext/ pdfbox/ java

I have different types of pdf which contain multiple things like text, table etc. The table may exist any place of pdf(top, middle, bottom).我有不同类型的 pdf，其中包含多种内容，如文本、表格等。表格可能存在 pdf 的任何位置（顶部、中间、底部）。 I want to extract only table data(No. of the column, no. of rows & data in a table) from that pdf using java without passing location.我只想使用java从该pdf中提取表数据（列数，表中的行数和数据）而不传递位置。

What I have done till yet:-我迄今为止所做的：-

1. I have used iText java API to read and extract. 1.我已经使用iText java API来读取和提取。 Following code used:-使用以下代码：-

PdfTextExtractor.getTextFromPage PdfTextExtractor.getTextFromPage

but It is only returning data in form of text.但它仅以文本形式返回数据。 Didn't get any clue to identify where table exists in pdf and how to extract data from that table.没有任何线索来确定 pdf 中表格的存在位置以及如何从该表格中提取数据。

2. I have also used PDFBox java API but it didn't solve my problem too. 2. 我也用过 PDFBox java API 但它也没有解决我的问题。

3. I have also followed this stack overflow link:- PDF table extraction But it is not giving me expected output. 3.我也遵循了这个堆栈溢出链接：- PDF表提取但它没有给我预期的输出。 This algorithm needs except line position and all.该算法需要除线位置以外的所有。

I am not able to identify where to locate the table in pdf.我无法确定在 pdf 中找到表格的位置。

Can anybody tell me how to solve this problem using iText & PDF box API or is there any open source API which can help me to solve this problem?谁能告诉我如何使用 iText & PDF box API 解决这个问题，或者是否有任何开源 API 可以帮助我解决这个问题？

Or can we convert pdf into html so that by table tags we can identify table and read ;)?或者我们可以将pdf转换为html，以便通过表格标签我们可以识别表格并阅读;)？

2 个解决方案

You can try using Tabula which is an open-source tool to detect and extract tables from pdf documents.您可以尝试使用Tabula ，它是一种开源工具，可以从 pdf 文档中检测和提取表格。 You can extend tabula-java and extract the table details.您可以扩展 tabula-java 并提取表详细信息。 More can be found here .在这里可以找到更多信息。

If you are also looking to extract text from the document then you can use PDFBox or Apache Tika for extracting texts only.如果您还想从文档中提取文本，那么您可以使用 PDFBox 或 Apache Tika 仅提取文本。

It basically depends on your input document, and how much effort you're willing to put into this project.它基本上取决于您的输入文档，以及您愿意为这个项目付出多少努力。

A pdf does not work like an html-document. pdf 不像 html 文档那样工作。 In html documents you have logical tags like "table" or "paragraph".在 html 文档中，您有诸如“表格”或“段落”之类的逻辑标签。 A pdf document (in the most basic case) contains only the instructions needed to render the document. pdf 文档（在最基本的情况下）仅包含呈现文档所需的说明。 So instead of getting "table" you might get "draw a line here, and another one a bit further away, and then another one that crosses both, and so on"因此，不是得到“桌子”，你可能会得到“在这里画一条线，再远一点的另一条线，然后另一条穿过这两条线，依此类推”

Also, according to the pdf specification, these instructions don't even have to appear in logical (reading) order.此外，根据 pdf 规范，这些说明甚至不必按逻辑（阅读）顺序出现。

If you are lucky, your input pdf might be a tagged PDF.如果幸运的话，您输入的 pdf 可能是带标签的 PDF。 Tagged pdfs contain an internal representation of the underlying structure in the document.标记的 pdf 包含文档中底层结构的内部表示。 A tagged pdf might be able to tell you exactly which objects in the document make up the table.带标签的 pdf 可能能够准确地告诉您文档中的哪些对象构成了表格。

Now, to get back to an actual answer.现在，回到实际答案。 If you want a solution that always works, you can implement the iText7 IEventListener class.如果您想要一个始终有效的解决方案，您可以实现 iText7 IEventListener 类。 This class has a method eventOccurred() that gets called every time the parser has finished dealing with an object (like a piece of text, a line, etc)此类有一个方法 eventOccurred() 每次解析器完成处理对象（如一段文本、一行等）时都会调用该方法

If you then look out for lines, and build some heuristic to determine when a collection of lines constitutes a table, you should be able to detect tables.如果您随后寻找线条，并构建一些启发式方法来确定线条集合何时构成表格，那么您应该能够检测表格。

IText also plans on releasing a pdf2Data addon, which will basically do the heavy lifting for you. IText 还计划发布一个 pdf2Data 插件，它基本上将为您完成繁重的工作。