简体   繁体   中英

extract text and labels from PDF document

I am trying to detect and extract the "labels" and "dimensions" of a 2D technical drawing which is being saved as PDF using python. I came across a python library call "pytesseract" which has optical character recognition capability. I tried the demo on my image but it fails to detect most of the label/dimensions. Please suggest if there is other way to do it. Thank you**.

** Attached is a sample of the 2D technical drawing I try to detect

@D 技术图纸

** what I am trying to achieve is to able to obtain the coordinate of every dimensions (the 160,120,10 4x45 etc) on the image, and extract the, as well.

About 16 months ago we asked ourselves the same question. If you want to implement it yourself, I'd suggest the following process:

  1. Extract the Canvas from the sheet
  2. Separate the Cuts
  3. Detect the Measure Regions on each Cut
  4. Detect the individual attributes of the Measure Regions to understand where the Measure Start & End. In your particular example that's relatively easy.
  5. Run the detected Measure Labels through OCR
  6. Associate the Labels to the Measures
  7. Verify your results

Alternatively you can also run it through our API and get the results as JSON.

Here's a quick visualization of the result: Drawing Read (GT stands for General Tolerances)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM