简体   繁体   中英

Text Extraction from pdf image file

I have an image file, and I want to extract text from a given image, I tried various OCR engine but I am unable to find the relationship between left side entity and right side entity because OCR engine simply extracts text without the relationship between an entity. For Example Transaction (Company borrow money), account#1: Cash account#2: Loan payable

I have tried text extraction using various OCR engine and PyPDF2 and pdftotext I have attached an image file for which I am trying extract text and trying to find the relationship between the left entity and right side entity

  • Are all the images to be analyzed like that?
  • Does that example reflect the reality of the images you'll be analyzing?
  • Will the limits of each column always be in the same position?

Since you didn't specify this, I'm going to assume yes for all.

The main problem is after getting the OCR string, you won't be able to decide if a space is a space between words, or a space between columns.

To solve this, crop the image on each column and do the OCR on each column individually, so you should end up with 3 strings, one for each column.

Split each string by '\\n', you should have 3 arrays containing the lines in each column

Compare the size of the arrays, if any of the 3 has a different size, there was an extraction failure and you should retry/clean up the image.

Iterate the elements on the second and/or third array, look for elements that are just "\\n", assuming you can't have empty fields here, if a line is just a "\\n" it must mean that the field on the first column uses up 2 or more lines, so remove this element on the first and second array and join this element and the next on the first array.

If all three arrays have the same number of elements, and you joined the entries that use more than one line, you're good to go and know that the relationship is set by the position of the array.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM