简体   繁体   English

从 PDF 文档中提取文本和标签

[英]extract text and labels from PDF document

I am trying to detect and extract the "labels" and "dimensions" of a 2D technical drawing which is being saved as PDF using python.我正在尝试检测和提取使用 python 保存为 PDF 的 2D 技术图纸的“标签”和“尺寸”。 I came across a python library call "pytesseract" which has optical character recognition capability.我遇到了一个名为“pytesseract”的 Python 库,它具有光学字符识别功能。 I tried the demo on my image but it fails to detect most of the label/dimensions.我在我的图像上尝试了演示,但它无法检测到大部分标签/尺寸。 Please suggest if there is other way to do it.请建议是否有其他方法可以做到这一点。 Thank you**.谢谢**。

** Attached is a sample of the 2D technical drawing I try to detect ** 附件是我尝试检测的 2D 技术图纸样本

@D 技术图纸

** what I am trying to achieve is to able to obtain the coordinate of every dimensions (the 160,120,10 4x45 etc) on the image, and extract the, as well. ** 我想要实现的是能够获得图像上每个维度(160,120,10 4x45 等)的坐标,并提取它们。

About 16 months ago we asked ourselves the same question.大约 16 个月前,我们问自己同样的问题。 If you want to implement it yourself, I'd suggest the following process:如果你想自己实现它,我建议采用以下过程:

  1. Extract the Canvas from the sheet从工作表中提取画布
  2. Separate the Cuts分离切口
  3. Detect the Measure Regions on each Cut检测每个切割上的测量区域
  4. Detect the individual attributes of the Measure Regions to understand where the Measure Start & End.检测测量区域的各个属性以了解测量开始和结束的位置。 In your particular example that's relatively easy.在您的特定示例中,这相对容易。
  5. Run the detected Measure Labels through OCR通过 OCR 运行检测到的度量标签
  6. Associate the Labels to the Measures将标签与度量相关联
  7. Verify your results验证您的结果

Alternatively you can also run it through our API and get the results as JSON.或者,您也可以通过我们的 API 运行它并以 JSON 形式获取结果。

Here's a quick visualization of the result: Drawing Read (GT stands for General Tolerances)这是结果的快速可视化:绘图读取(GT 代表一般公差)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM