简体   繁体   中英

Camelot Cannot extract entire table

Im using Camelot to extract table information from a PDF that i have converted from scanned to searchable using ocrmypdf(500dpi).

Camelot seems to be able to identify the table and extract most of the data within the table but it seems to be unable to extract the bottom half. In essence, it sees the top half of the table but seems to be unable to separate the text from the lower half.

This is the table from the PDF in question:

PDF 表格

But when i use the visual debugging method of Camelot where i ask it to show me the words it will extract it seems to recognize the bottom section of the table as one giant block

表格可视化调试

Any guidance you can provide on improving Camelots "vision" here would be helpful.

Apart from the block, the horizontal lines are also marked as text, which is odd.

Camelot uses pdfminer.six for text extraction and you can pass LAParams (page 16) to camelot.read_pdf() to tweak that.
You should also check out camelot.plot(table, type="grid") to see if the lines are recognized correctly. If not, that might be where the problem lies.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM