简体   繁体   中英

Text Extraction on a Generated PDF report in Java

I have a pdf of academic result of over 6500 students. I don't have access to actual database, what I'm dreaming is to extract data from this long complex yet fairly well formatted document. This data will be used for analysis and visualization purpose.

Here's first 5 pages of this document ~1 MB .

Please help me with-

  1. Is it possible to extract this data? If yes how much time would it take to write code for that?
  2. Some tools and libraries preferably in JAVA.
  3. Links to Tutorials or guides.

Thanks in advance.

Is it possible to extract this data?

Yes. The PDF contains all the information required to extract textual data from your document. Furthermore the table columns seem to start at the same respective position on each page.

One way to do it would require extracting the text without destroying the layout. This is quite sensible and easy for the document in question as it has been created from a pure text file to start with. Then one can analyze that text on a line-by-line base.

If yes how much time would it take to write code for that?

That depends on the skill of the coder. Text extraction would be done using some PDF library, so only the analysis of the text remains, and in case of your file that looks easy. On the first day a prove of concept should be possible, and all-in-all it should not take more than a week.

Some tools and libraries preferably in JAVA.

There are multiple open source libraries (iText, PDFBox, PDFClown coming to my mind; be sure to understand the respective licensing conditions), and there also are numerous closed-source libraries out there also offering text extraction features.

Links to Tutorials or guides.

Tutorials / guides / samples generally can be found on the web site of the chosen library.

My advice would be to try several such libraries and check whether their text extraction output is true to the original layout, whether their performance is adequate, whether their resource requirements are acceptable, and whether their license conditions are ok for you.

(The following is the original answer relating to the originally provided PDF which was built to prevent text extraction)

Is it possible to extract this data?

While your document indeed looks fairly well formatted , it strictly speaking contains no text. You might well have tried to copy&paste from a PDF viewer and have been disappointed to see that it cannot extract anything.

Instead of text drawing operations (from which text usually is extracted more or less easily), your PDF uses path drawing operations, ie lines, curves, etc., and with them paints the text using many operations for each single letter. This, by the way, explains the gigantic size of the file.

Thus, the text is not immediately extractable from your document. You either have to go through the content, recognize drawing the operations creating a single letter, and build a text from that; or you have to render the PDF as a bitmap and apply OCR.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM