简体   繁体   中英

iText PDF Text Extraction with fonts and styles

I am using iText to extract text from PDF to a String but I have encountered a problem with some PDF. When I tried to extract text, the reader extract only blanks/destroyed text on SOME pdfs.

Example of destroyed text:

"th isbe long to t he t est fo r extr act ion tex t"

What is the cause of this problem?

I am thinking of removing the fonts and change the font to a suitable one to be read by the reader. I have tried researching about this, but what I found does not help me.

This is caused by the way text is stored in the PDF file. It just puts letters with information for rendering and location. The text extraction algorithm is smart in that it finds letters that seem to be close together and, if so, it puts them together. If they aren't that close, it puts in some space.

I can't tell you what to do about it, though.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM