繁体   English   中英

如何使用PDFMiner从pdf提取表?

[英]How to extract tables from a pdf with PDFMiner?

我正在尝试从pdf文档中的某些表中提取信息。
考虑输入:

Title 1
some text some text some text some text some text
some text some text some text some text some text

Table Title
| Col1          | Col2    | Col3    |
|---------------|---------|---------|
| val11         | val12   | val13   |
| val21         | val22   | val23   |
| val31         | val32   | val33   |

Title 2
some more text some more text some more text some more text
some more text
some more text some more text some more text some more text

我可以这样获得大纲/标题:

path='myFile.pdf'
# Open a PDF file.
fp = open(path, 'rb')
# Create a PDF parser object associated with the file object.
parser = PDFParser(fp)
# Create a PDF document object that stores the document structure.
# Supply the password for initialization.
document = PDFDocument(parser, '')
outlines = document.get_outlines()
for (level,title,dest,a,se) in outlines:
    print (level, title)

这给了我:

(1, u'Title 1')
(2, u'Table Title')
(1, u'Title 2')

这是完美的,因为级别与文本层次结构保持一致。 现在,我可以提取文本,如下所示:

if not document.is_extractable:
    raise PDFTextExtractionNotAllowed
# Create a PDF resource manager object that stores shared resources.
rsrcmgr = PDFResourceManager()
# Create a PDF device object.
laparams = LAParams()
device = PDFPageAggregator(rsrcmgr, laparams=laparams)
# Create a PDF interpreter object.
interpreter = PDFPageInterpreter(rsrcmgr, device)
# Process each page contained in the document.
text_from_pdf = open('textFromPdf.txt','w')
for page in PDFPage.create_pages(document):
    interpreter.process_page(page)
    layout = device.get_result()
    for element in layout:
        if isinstance(element, LTTextBox):
            text_from_pdf.write(''.join([i if ord(i) < 128 else ' '
                                            for i in element.get_text()]))

这给了我:

Title 1
some text some text some text some text some text some text some text
some text some text some text some text some text some text some text
Table Title
Col1
val11
val12
val13
Col2
val21
val22
val23
Col3
val31
val32
val33
Title 2
some more text some more text some more text some more text
some more text
some more text some more text some more text some more text

这是有点奇怪的,因为该表是以列方式提取的。 我可以逐行获取表格吗? 此外,如何确定表的开始和结束位置?

如果您只想从PDF文档中提取表格,请查看以下答案: 如何使用Python从PDF中提取表格作为文本?

从该答案中,我尝试了tabula-py ,它适用于分布在多页PDF中的数字表格。 tabula-py正确跳过了所有页眉和页脚。 以前,我曾在相同类型的文档上尝试过PDFMiner,但遇到的问题与您提到的相同,有时甚至更糟。

使用camelot从pdf中提取表格

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM