I'm making an application which allows searching in pdf's using apache Solr. I was having trouble finding certain terms in pdfs.
I noticed words in columns got appended.
Example
Column1 | Column2
stack | overflow
Here the PdftextStripper would sometimes give me stackoverflow as extracted text. This would lead to bad tokinazation in solr which prevents you from finding the term. (Yes I know I can use wildcards but that doesn't work in phrase queries)
I have been looking at the sources to see what causes the problem. But it seems that the writePage method has to guess the spaces. I can't really change this since it seems very complex.
Are there any other solutions to get a good text extraction from a pdf with columns?
I got the same problem while extracting text with PDFbox. I solved this issue by taking the position information of each character. I took x position and y position of each character. And implemented a simple logic to distinguish words. Before that my word delimitter was only the " "(space). I added one more logic that if the difference of the X position of two characters are beyond a certain value (this value will be your choice.) and it is in the same line, that is same y coordinate (Different y coordinate means certainly a new word), I treated them as a new word. With this logic I was able to solve problems with table content, new line etc.
This link will help you to get the position of characters from pdf with PDFbox.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.