简体繁体中英

Camelot Cannot extract entire table

原文 2021-06-26 14:58:16 7 1 python/ pdf-extraction/ python-camelot/ pdftables/ ocrmypdf

Im using Camelot to extract table information from a PDF that i have converted from scanned to searchable using ocrmypdf(500dpi).

Camelot seems to be able to identify the table and extract most of the data within the table but it seems to be unable to extract the bottom half. In essence, it sees the top half of the table but seems to be unable to separate the text from the lower half.

This is the table from the PDF in question:

But when i use the visual debugging method of Camelot where i ask it to show me the words it will extract it seems to recognize the bottom section of the table as one giant block

Any guidance you can provide on improving Camelots "vision" here would be helpful.

1 answers

Apart from the block, the horizontal lines are also marked as text, which is odd.

Camelot uses pdfminer.six for text extraction and you can pass LAParams (page 16) to camelot.read_pdf() to tweak that.
You should also check out camelot.plot(table, type="grid") to see if the lines are recognized correctly. If not, that might be where the problem lies.

Python PDF Parsing with Camelot and Extract the Table Title

Problems to extract table data using camelot without error message

Camelot single table inheritance

How to extract table name along with table using camelot from pdf files using python?

Camelot not detecting table within table

how to extract tables from pdf using camelot?

Unable to extract tables from tabula or Camelot

Can camelot use pdf “primitives” to extract data?

Cannot extract the html table

tabula vs camelot for table extraction from PDF

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question Python PDF Parsing with Camelot and Extract the Table Title Problems to extract table data using camelot without error message Camelot single table inheritance How to extract table name along with table using camelot from pdf files using python? Camelot not detecting table within table how to extract tables from pdf using camelot? Unable to extract tables from tabula or Camelot Can camelot use pdf “primitives” to extract data? Cannot extract the html table tabula vs camelot for table extraction from PDF

Related Tags

Camelot Cannot extract entire table

Question

1 answers

solution1 0 2021-10-26 10:20:00

solution1
0 2021-10-26 10:20:00