简体繁体中英

Navigate through a pdf file to find specific pages and extract tabular data from image with python

原文 2021-12-15 09:55:34 7 1 python/ python-3.x/ dataframe/ tabular/ image-extraction

I've come across an assignment which requires me to extract tabular data from images in a pdf file to neatly formatted dataframes via python code. There are several files to be processed and the relevant pages in all the files the may have different page numbers, hence the sequence of steps for this problem (my assumption) are:

Navigate to relevant section of the pdf
Extract images of the tabular data
Extract data from the images, format and convert to dataframes.

Some google searches resulted in me finding libraries for pdf text extraction, table extraction and more - modular solutions only.

I would appreciate some help in this regard. What packages should I use? Is my approach correct? Can I get references to any helpful code snippets for similar problems?

page structure of the required tables

1 answers

This started as a comment. I believe the answer is valid as it is in no way an endorsement of the service. I don't even use it. I know Azure uses SO as well.

This is the stuff of commercial services. You can try Azure Form Recognizer (with which I am not affiliated):

https://docs.microsoft.com/en-us/azure/applied-ai-services/form-recognizer

Here are some python examples of how to use it:

https://docs.microsoft.com/en-us/azure/applied-ai-services/form-recognizer/how-to-guides/try-sdk-rest-api?pivots=programming-language-python

The AWS equivalent is Textract https://aws.amazon.com/textract

The Google Cloud version is called Form Parser - see https://cloud.google.com/document-ai/docs/processors-list#processor_form-parser

how to extract tabular data from pdf properly when a row data is divided in two separate pages?

Extract specific pages of PDF and save it with Python

How to extract data from image that contains tabular data?

Extract an image from a PDF in python

extract specific data from txt file in python

Extract Specific Data from Txt file python

How to extract only specific text from PDF file using python

Extract specific Data values from Invoices PDF using PDFminer : Python

Extract specific data from .pdf and save in Excel file

Extract column from tabular data

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question how to extract tabular data from pdf properly when a row data is divided in two separate pages? Extract specific pages of PDF and save it with Python How to extract data from image that contains tabular data? Extract an image from a PDF in python extract specific data from txt file in python Extract Specific Data from Txt file python How to extract only specific text from PDF file using python Extract specific Data values from Invoices PDF using PDFminer : Python Extract specific data from .pdf and save in Excel file Extract column from tabular data

Related Tags

Navigate through a pdf file to find specific pages and extract tabular data from image with python

Question

1 answers

solution1 0 2021-12-15 09:58:32

solution1
0 2021-12-15 09:58:32