简体   繁体   中英

iText reading multicolumned PDF document

Reading multicolumned PDF document

When iText read the PDF (Extract a page content into a string variable), then the content would be fixed there by:

reader = new PdfReader(getResources().openRawResource(R.raw.resume1));
original_content = PdfTextExtractor.getTextFromPage(reader, 2);
String sub_content = original_content.trim().replaceAll(" {2,}", " ");
sub_content = sub_content.trim().replaceAll("\n ", "\n");
sub_content = sub_content.replaceAll("(.+)(?<!\\.)\n(?!\\W)", "$1 "); 

if the document is 1 column only but if the document has multicolumn, it would extract the document 1 per line. it would combine left and right column.

I am using this as a sample PDF this is from START QA document.

How to read a multicolumned PDF document?

There are two different approaches to this problem, and the choice which to use depends on the PDF itself.

  1. If strings in the page content of the PDF in questions already are in the desired order: Instead of the LocationTextExtractionStrategy implicitly used by the overload of PdfTextExtractor.getTextFromPage you use, explicitly use the SimpleTextExtractionStrategy ; in your case:

     original_content = PdfTextExtractor.getTextFromPage(reader, 2, new SimpleTextExtractionStrategy()); 
  2. If the strings in the page content of the PDF in question are not in the desired order: Instead of the LocationTextExtractionStrategy implicitly used by the overload of PdfTextExtractor.getTextFromPage you use, explicitly wrap one such strategy in a FilteredTextRenderListener restricting it to receive text for the area of a single column only; in your case:

     Rectangle left = new Rectangle(0, 0, 306, 792); Rectangle right = new Rectangle(306, 0, 612, 792); RenderFilter leftFilter = new RegionTextRenderFilter(left); RenderFilter rightFilter = new RegionTextRenderFilter(right); [...] TextExtractionStrategy strategy = new FilteredTextRenderListener(new LocationTextExtractionStrategy(), leftFilter); original_content = PdfTextExtractor.getTextFromPage(reader, 2, strategy); originalContent += " "; strategy = new FilteredTextRenderListener(new LocationTextExtractionStrategy(), rightFilter); original_content += PdfTextExtractor.getTextFromPage(reader, 2, strategy); 

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM