简体繁体 English

从pdf图像文件中提取文本

[英]Text Extraction from pdf image file

原文 2019-08-27 14:05:03 2 1 python/ image/ ocr/ text-extraction

I have an image file, and I want to extract text from a given image, I tried various OCR engine but I am unable to find the relationship between left side entity and right side entity because OCR engine simply extracts text without the relationship between an entity. 我有一个图像文件，我想从给定的图像中提取文本，我尝试了各种OCR引擎，但是我无法找到左侧实体和右侧实体之间的关系，因为OCR引擎只是提取文本而没有实体之间的关系。 For Example Transaction (Company borrow money), account#1: Cash account#2: Loan payable 例如，交易（公司借钱），帐户1：现金帐户2：应付贷款

I have tried text extraction using various OCR engine and PyPDF2 and pdftotext I have attached an image file for which I am trying extract text and trying to find the relationship between the left entity and right side entity 我尝试使用各种OCR引擎以及PyPDF2和pdftotext 提取文本，并附加了一个图像文件，为此我尝试提取文本并尝试查找左侧实体和右侧实体之间的关系。

1 个解决方案

Are all the images to be analyzed like that? 是否所有要分析的图像都是这样？
Does that example reflect the reality of the images you'll be analyzing? 该示例是否反映了您要分析的图像的真实性？
Will the limits of each column always be in the same position? 每列的极限值将始终保持在同一位置吗？

Since you didn't specify this, I'm going to assume yes for all. 由于您未指定，因此我将假设所有人都同意。

The main problem is after getting the OCR string, you won't be able to decide if a space is a space between words, or a space between columns. 主要问题是获取OCR字符串后，您将无法确定空格是单词之间的空格还是列之间的空格。

To solve this, crop the image on each column and do the OCR on each column individually, so you should end up with 3 strings, one for each column. 要解决此问题，请在每列上裁剪图像并分别在每列上执行OCR，因此您应该以3个字符串结尾，每列一个。

Split each string by '\\n', you should have 3 arrays containing the lines in each column 用'\\ n'分割每个字符串，您应该有3个数组，每列包含行

Compare the size of the arrays, if any of the 3 has a different size, there was an extraction failure and you should retry/clean up the image. 比较阵列的大小，如果三个阵列中的任何一个具有不同的大小，则说明提取失败，您应该重试/清理图像。

Iterate the elements on the second and/or third array, look for elements that are just "\\n", assuming you can't have empty fields here, if a line is just a "\\n" it must mean that the field on the first column uses up 2 or more lines, so remove this element on the first and second array and join this element and the next on the first array. 迭代第二个和/或第三个数组上的元素，查找只是“ \\ n”的元素，假设您此处没有空字段，如果一行仅是“ \\ n”，则必须表示该字段第一列占用2行或更多行，因此请删除第一个和第二个数组上的此元素，并将此元素和下一个数组上的下一个元素连接起来。

If all three arrays have the same number of elements, and you joined the entries that use more than one line, you're good to go and know that the relationship is set by the position of the array. 如果所有三个数组都具有相同数量的元素，并且您加入了使用不止一行的条目，那么您就可以很好地知道该关系是由数组的位置设置的。