简体繁体 English

从 PDF 文档中提取文本和标签

[英]extract text and labels from PDF document

原文 2020-03-07 08:19:41 2 1 python/ opencv/ image-processing/ ocr/ text-recognition

I am trying to detect and extract the "labels" and "dimensions" of a 2D technical drawing which is being saved as PDF using python.我正在尝试检测和提取使用 python 保存为 PDF 的 2D 技术图纸的“标签”和“尺寸”。 I came across a python library call "pytesseract" which has optical character recognition capability.我遇到了一个名为“pytesseract”的 Python 库，它具有光学字符识别功能。 I tried the demo on my image but it fails to detect most of the label/dimensions.我在我的图像上尝试了演示，但它无法检测到大部分标签/尺寸。 Please suggest if there is other way to do it.请建议是否有其他方法可以做到这一点。 Thank you**.谢谢**。

** Attached is a sample of the 2D technical drawing I try to detect ** 附件是我尝试检测的 2D 技术图纸样本

** what I am trying to achieve is to able to obtain the coordinate of every dimensions (the 160,120,10 4x45 etc) on the image, and extract the, as well. ** 我想要实现的是能够获得图像上每个维度（160,120,10 4x45 等）的坐标，并提取它们。

1 个解决方案

About 16 months ago we asked ourselves the same question.大约 16 个月前，我们问自己同样的问题。 If you want to implement it yourself, I'd suggest the following process:如果你想自己实现它，我建议采用以下过程：

Extract the Canvas from the sheet从工作表中提取画布
Separate the Cuts分离切口
Detect the Measure Regions on each Cut检测每个切割上的测量区域
Detect the individual attributes of the Measure Regions to understand where the Measure Start & End.检测测量区域的各个属性以了解测量开始和结束的位置。 In your particular example that's relatively easy.在您的特定示例中，这相对容易。
Run the detected Measure Labels through OCR通过 OCR 运行检测到的度量标签
Associate the Labels to the Measures将标签与度量相关联
Verify your results验证您的结果