简体   繁体   English

使用Camelot从此PDF中提取数据时,找不到表和合并的列文本

[英]No tables found and merged column text when extracting data from this PDF using Camelot

I get a UserWarning: No tables found on page-1 when I try to extract tables from the attached PDF . 我收到用户UserWarning: No tables found on page-1当我尝试从随附的PDF提取表时, UserWarning: No tables found on page-1 However, when I looked at the extracted data, some of the column text was merged into a single column.” 但是,当我查看提取的数据时,某些列文本被合并到单个列中。”

在此处输入图片说明

I am using Camelot to parse these PDFs 我正在使用Camelot解析这些PDF

Steps to reproduce: camelot --output m27.csv --format csv stream m27.pdf 重现步骤: camelot --output m27.csv --format csv stream m27.pdf

Here is a link to PDF that I am trying to parse: https://github.com/tabulapdf/tabula-java/blob/master/src/test/resources/technology/tabula/m27.pdf 这是我要解析的PDF链接: https : //github.com/tabulapdf/tabula-java/blob/master/src/test/resources/technology/tabula/m27.pdf

A PDF just contains instructions to place a character at an x,y coordinate on a 2-D plane, retaining no knowledge of words, sentences or tables. PDF仅包含将字符放置在二维平面上的x,y坐标上的说明,而没有单词,句子或表格的知识。

Camelot uses PDFMiner under the hood to group characters into words and words into sentences. Camelot在后台使用PDFMiner将字符分为单词和单词组成句子。 Sometimes when the characters are too close, PDFMiner can group characters belonging to different words into a single one. 有时,当字符过于接近时,PDFMiner可以将属于不同单词的字符分组为一个单词。

Since the characters in your PDF table are placed very close, they are being merged into a single word and hence Camelot isn't able to detect the columns correctly. 由于PDF表格中的字符非常靠近,因此它们被合并为一个单词,因此Camelot无法正确检测列。 You can specify column separators to get the table out in this case. 在这种情况下,您可以指定列分隔符来获取表。 To get the x-coordinates of column separators you can check out the visual debugging guide . 要获取列分隔符的x坐标,可以查看可视化调试指南 Additionally, you can specify split_text=True to cut the word along the column separators you've specified. 另外,您可以指定split_text=True沿指定的列分隔符剪切单词。 Here's the code (I got the x-coordinates by creating a matplotlib plot of the text in the PDF using $ camelot stream -plot text m27.pdf ): 这是代码(我通过使用$ camelot stream -plot text m27.pdf在PDF中创建文本的matplotlib图获得了x坐标):

Using CLI: 使用CLI:

$ camelot --output m27.csv --format csv -split stream -C 72,95,209,327,442,529,566,606,683 m27.pdf

Using API: 使用API​​:

>>> import camelot
>>> tables = camelot.read_pdf('m27.pdf', flavor='stream', columns=['72,95,209,327,442,529,566,606,683'], split_text=True)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM