簡體   English   中英

如何使用PDFMiner從pdf提取表?

[英]How to extract tables from a pdf with PDFMiner?

我正在嘗試從pdf文檔中的某些表中提取信息。
考慮輸入:

Title 1
some text some text some text some text some text
some text some text some text some text some text

Table Title
| Col1          | Col2    | Col3    |
|---------------|---------|---------|
| val11         | val12   | val13   |
| val21         | val22   | val23   |
| val31         | val32   | val33   |

Title 2
some more text some more text some more text some more text
some more text
some more text some more text some more text some more text

我可以這樣獲得大綱/標題:

path='myFile.pdf'
# Open a PDF file.
fp = open(path, 'rb')
# Create a PDF parser object associated with the file object.
parser = PDFParser(fp)
# Create a PDF document object that stores the document structure.
# Supply the password for initialization.
document = PDFDocument(parser, '')
outlines = document.get_outlines()
for (level,title,dest,a,se) in outlines:
    print (level, title)

這給了我:

(1, u'Title 1')
(2, u'Table Title')
(1, u'Title 2')

這是完美的,因為級別與文本層次結構保持一致。 現在,我可以提取文本,如下所示:

if not document.is_extractable:
    raise PDFTextExtractionNotAllowed
# Create a PDF resource manager object that stores shared resources.
rsrcmgr = PDFResourceManager()
# Create a PDF device object.
laparams = LAParams()
device = PDFPageAggregator(rsrcmgr, laparams=laparams)
# Create a PDF interpreter object.
interpreter = PDFPageInterpreter(rsrcmgr, device)
# Process each page contained in the document.
text_from_pdf = open('textFromPdf.txt','w')
for page in PDFPage.create_pages(document):
    interpreter.process_page(page)
    layout = device.get_result()
    for element in layout:
        if isinstance(element, LTTextBox):
            text_from_pdf.write(''.join([i if ord(i) < 128 else ' '
                                            for i in element.get_text()]))

這給了我:

Title 1
some text some text some text some text some text some text some text
some text some text some text some text some text some text some text
Table Title
Col1
val11
val12
val13
Col2
val21
val22
val23
Col3
val31
val32
val33
Title 2
some more text some more text some more text some more text
some more text
some more text some more text some more text some more text

這是有點奇怪的,因為該表是以列方式提取的。 我可以逐行獲取表格嗎? 此外,如何確定表的開始和結束位置?

如果您只想從PDF文檔中提取表格,請查看以下答案: 如何使用Python從PDF中提取表格作為文本?

從該答案中,我嘗試了tabula-py ,它適用於分布在多頁PDF中的數字表格。 tabula-py正確跳過了所有頁眉和頁腳。 以前,我曾在相同類型的文檔上嘗試過PDFMiner,但遇到的問題與您提到的相同,有時甚至更糟。

使用camelot從pdf中提取表格

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM