简体   繁体   中英

How to avoid pdfbox appending separate words

I'm making an application which allows searching in pdf's using apache Solr. I was having trouble finding certain terms in pdfs.

I noticed words in columns got appended.

Example

 Column1 | Column2
 stack   | overflow

Here the PdftextStripper would sometimes give me stackoverflow as extracted text. This would lead to bad tokinazation in solr which prevents you from finding the term. (Yes I know I can use wildcards but that doesn't work in phrase queries)

I have been looking at the sources to see what causes the problem. But it seems that the writePage method has to guess the spaces. I can't really change this since it seems very complex.

Are there any other solutions to get a good text extraction from a pdf with columns?

  • Maybe some sort of conversion other program.
  • Maybe patch for pdfbox.
  • Yes I've seen similar question but they mostly handle the order of the extraction(which in my case doesn't matter that much).

I got the same problem while extracting text with PDFbox. I solved this issue by taking the position information of each character. I took x position and y position of each character. And implemented a simple logic to distinguish words. Before that my word delimitter was only the " "(space). I added one more logic that if the difference of the X position of two characters are beyond a certain value (this value will be your choice.) and it is in the same line, that is same y coordinate (Different y coordinate means certainly a new word), I treated them as a new word. With this logic I was able to solve problems with table content, new line etc.

This link will help you to get the position of characters from pdf with PDFbox.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM