简体繁体中英

extract text and labels from PDF document

原文 2020-03-07 08:19:41 6 1 python/ opencv/ image-processing/ ocr/ text-recognition

I am trying to detect and extract the "labels" and "dimensions" of a 2D technical drawing which is being saved as PDF using python. I came across a python library call "pytesseract" which has optical character recognition capability. I tried the demo on my image but it fails to detect most of the label/dimensions. Please suggest if there is other way to do it. Thank you**.

** Attached is a sample of the 2D technical drawing I try to detect

** what I am trying to achieve is to able to obtain the coordinate of every dimensions (the 160,120,10 4x45 etc) on the image, and extract the, as well.

1 answers

About 16 months ago we asked ourselves the same question. If you want to implement it yourself, I'd suggest the following process:

Extract the Canvas from the sheet
Separate the Cuts
Detect the Measure Regions on each Cut
Detect the individual attributes of the Measure Regions to understand where the Measure Start & End. In your particular example that's relatively easy.
Run the detected Measure Labels through OCR
Associate the Labels to the Measures
Verify your results

Alternatively you can also run it through our API and get the results as JSON.

Here's a quick visualization of the result: Drawing Read (GT stands for General Tolerances)

Extract text from PDF

Extract text from a PDF with regex

Extract Text from MediaBox - PDF

Extract text from pdf to file

Extract underlined text from pdf

Python code to extract txt from PDF document

Extract embedded pdf document from a webpage

Extract only bold text from PDF documents

extract text from pdf file object in python

How extract text from this compressed PDF/A?

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question Extract text from PDF Extract text from a PDF with regex Extract Text from MediaBox - PDF Extract text from pdf to file Extract underlined text from pdf Python code to extract txt from PDF document Extract embedded pdf document from a webpage Extract only bold text from PDF documents extract text from pdf file object in python How extract text from this compressed PDF/A?

Related Tags

extract text and labels from PDF document

Question

1 answers

solution1 0 2020-05-08 22:16:44

solution1
0 2020-05-08 22:16:44