简体繁体 English

测量 tesseract ocr 的图像处理质量

[英]Measuring image processing quality for tesseract ocr

原文 2022-01-16 00:11:34 1 1 python/ tesseract/ image-preprocessing

I'm testing various Python image pre-processing pipelines for tesseract-ocr.我正在测试tesseract-ocr 的各种 Python图像预处理管道。

My input data are pdf invoices and receipts of all manner of quality from scanned documents (best) to mobile phone supplied photos taken in poor lighting (worst), and everything in between.我的输入数据是 pdf 发票和各种质量的收据，从扫描文件（最好）到手机提供的在光线不足的情况下拍摄的照片（最差），以及介于两者之间的所有内容。 When performing manual scanning for OCR, I typically choose among several scanning presets (unsharp mask, edge fill, color enhance, gamma).在为 OCR 执行手动扫描时，我通常会在几个扫描预设（不锐化蒙版、边缘填充、颜色增强、伽玛）中进行选择。 I'm thinking about implementing a similar solution in a Python pipeline.我正在考虑在 Python 管道中实施类似的解决方案。

I understand the standard metric for OCR quality is Levenshtein (Edit distance), which is a measure of the quality of results compared to ground truth.我了解OCR 质量的标准指标是Levenshtein （编辑距离），它是与地面实况相比结果质量的衡量标准。

What I'm after are measurements of image processing effects on OCR results qualtiy.我所追求的是测量图像处理对 OCR 结果质量的影响。 For example, in this paper Prediction of OCR Accuracy the author describes at least two measurements White Speckle Factor (WSF) and Broken Character Factor (BCF) .例如，在这篇论文Prediction of OCR Accuracy中，作者描述了至少两个测量值White Speckle Factor (WSF)和Broken Character Factor (BCF) 。 Other descriptors I've read include salt and pepper noise and aberrant pixels .我读过的其他描述符包括椒盐噪声和异常像素。

I've worked my way through 200 of the near 4k tesseract tagged questions here.我已经在这里解决了 200 个近 4k tesseract 标记的问题。 Very interesting.很有意思。 Most questions are of the type, I have this kind of image, how can I improve the OCR outcomes.大多数问题都是这种类型的，我有这种图像，我怎样才能改善 OCR 结果。 Nothing so far about measuring the image-processing effect on OCR outcomes.到目前为止，还没有关于测量图像处理对 OCR 结果的影响。

A curious question was this one, Dirty Image Quality Assesment Measure , but the question is not focused on OCR and the solutions seem overkill.一个奇怪的问题是Dirty Image Quality Assesment Measure ，但这个问题并不集中在 OCR 上，而且解决方案似乎有点矫枉过正。

1 个解决方案

There is no universal image improvement technique for OCR-ability. OCR 能力没有通用的图像改进技术。 Every image defect is (partly) corrected with ad-hoc techniques, and a technique that works in one case can be counter-productive in another.每个图像缺陷都（部分）通过临时技术进行纠正，在一种情况下有效的技术在另一种情况下可能会适得其反。

For a homogenous data set (in the sense that all documents have similar origin/quality and were captured in the same conditions), you can indeed optimize the preprocessing chain by trying different combinations and settings, and computing the total edit distance.对于同质数据集（从某种意义上说，所有文档具有相似的来源/质量并在相同条件下捕获），您确实可以通过尝试不同的组合和设置并计算总编辑距离来优化预处理链。 But this requires preliminary knowledge of the ground truth (at least for a sampling of the documents).但这需要对基本事实的初步了解（至少对于文件的抽样）。

But for heterogeneous data sets, there is little that you can do.但对于异构数据集，您无能为力。 There remains the option of testing different preprocessing chains and relying on the recognition scores returned by the OCR engine, assuming that better readability corresponds to better correctness.假设更好的可读性对应更好的正确性，仍然可以选择测试不同的预处理链并依赖 OCR 引擎返回的识别分数。

You might also extract some global image characteristic such as contrast, signal-to-noise ratio, sharpness, character size and density... and optimize the readability as above.您还可以提取一些全局图像特征，例如对比度、信噪比、清晰度、字符大小和密度……并如上所述优化可读性。 Then feed this info to a classifier that will learn how to handle the different image conditions.然后将此信息提供给一个分类器，该分类器将学习如何处理不同的图像条件。 Honestly, I don't really believe in this approach.老实说，我不太相信这种方法。