简体繁体 English

用Java生成的PDF报告上的文本提取

[英]Text Extraction on a Generated PDF report in Java

原文 2014-08-04 14:08:46 1 1 java/ pdf/ text-extraction

I have a pdf of academic result of over 6500 students. 我有超过6500名学生的学术成绩PDF。 I don't have access to actual database, what I'm dreaming is to extract data from this long complex yet fairly well formatted document. 我没有访问实际数据库的权限，我梦dream以求的是从这个漫长而又格式合理的文档中提取数据。 This data will be used for analysis and visualization purpose. 此数据将用于分析和可视化目的。

Here's first 5 pages of this document ~1 MB . 这是本文档的前5页〜1 MB 。

Please help me with- 请帮我

Is it possible to extract this data? 是否可以提取此数据？ If yes how much time would it take to write code for that? 如果是的话，编写代码需要花费多少时间？
Some tools and libraries preferably in JAVA. 一些工具和库最好使用JAVA。
Links to Tutorials or guides. 链接到教程或指南。

Thanks in advance. 提前致谢。

1 个解决方案

Is it possible to extract this data? 是否可以提取此数据？

Yes. 是。 The PDF contains all the information required to extract textual data from your document. PDF包含从文档中提取文本数据所需的所有信息。 Furthermore the table columns seem to start at the same respective position on each page. 此外，表格列似乎在每个页面上的相同位置开始。

One way to do it would require extracting the text without destroying the layout. 一种方法是提取文本而不破坏布局。 This is quite sensible and easy for the document in question as it has been created from a pure text file to start with. 对于所讨论的文档而言，这是非常明智且容易的，因为它是从纯文本文件开始创建的。 Then one can analyze that text on a line-by-line base. 然后，您可以逐行分析该文本。

If yes how much time would it take to write code for that? 如果是的话，编写代码需要花费多少时间？

That depends on the skill of the coder. 这取决于编码人员的技能。 Text extraction would be done using some PDF library, so only the analysis of the text remains, and in case of your file that looks easy. 文本提取将使用某些PDF库完成，因此仅保留对文本的分析，并且在您的文件看起来很简单的情况下。 On the first day a prove of concept should be possible, and all-in-all it should not take more than a week. 在第一天就可以进行概念验证，而整个过程不超过一周。

Some tools and libraries preferably in JAVA. 一些工具和库最好使用JAVA。

There are multiple open source libraries (iText, PDFBox, PDFClown coming to my mind; be sure to understand the respective licensing conditions), and there also are numerous closed-source libraries out there also offering text extraction features. 有多个开源库（我想到的是iText，PDFBox，PDFClown；一定要了解各自的许可条件），此外，还有许多开源库也提供文本提取功能。

Links to Tutorials or guides. 链接到教程或指南。

Tutorials / guides / samples generally can be found on the web site of the chosen library. 教程/指南/样本通常可以在所选库的网站上找到。

My advice would be to try several such libraries and check whether their text extraction output is true to the original layout, whether their performance is adequate, whether their resource requirements are acceptable, and whether their license conditions are ok for you. 我的建议是尝试几个这样的库，并检查它们的文本提取输出是否符合原始布局，它们的性能是否足够，它们的资源要求是否可以接受以及它们的许可条件是否适合您。

(The following is the original answer relating to the originally provided PDF which was built to prevent text extraction) （以下是与最初提供的PDF有关的原始答案，该PDF是为防止文本提取而构建的）

Is it possible to extract this data? 是否可以提取此数据？

While your document indeed looks fairly well formatted , it strictly speaking contains no text. 虽然您的文档的格式看上去确实不错，但是严格来说，它不包含任何文本。 You might well have tried to copy&paste from a PDF viewer and have been disappointed to see that it cannot extract anything. 您可能已经尝试过从PDF查看器复制和粘贴，并且对它无法提取任何内容感到失望。

Instead of text drawing operations (from which text usually is extracted more or less easily), your PDF uses path drawing operations, ie lines, curves, etc., and with them paints the text using many operations for each single letter. 您的PDF代替了文本绘制操作（通常或多或少地从中提取文本），而是使用了路径绘制操作，即线条，曲线等，并且它们对每个单个字母使用许多操作来绘制文本。 This, by the way, explains the gigantic size of the file. 顺便说一下，这解释了文件的巨大大小。

Thus, the text is not immediately extractable from your document. 因此，无法立即从文档中提取文本。 You either have to go through the content, recognize drawing the operations creating a single letter, and build a text from that; 您要么必须浏览内容，要么识别绘制操作以创建单个字母，然后从中创建文本； or you have to render the PDF as a bitmap and apply OCR. 否则您必须将PDF渲染为位图并应用OCR。