如何使用python中的tabula提取pdf文件中存在的多个表？

Question

If only one table is present in a pdf file then that can be simply extracted using the code如果pdf文件中只有一个表格，那么可以使用代码简单地提取

from tabula import read_pdf
df = read_pdf(r"C:\Users\Himanshu Poddar\Desktop\pdf_file.pdf")

But if there is more than one table present in a pdf file.I am unable to extract those tables.Because its only extracting the first one.但是如果pdf文件中存在多个表。我无法提取这些表。因为它只提取第一个。

Answer 1

There?在那里？ Hope the below code will be helpful, still I didn't test it with large tables.希望下面的代码会有所帮助，但我仍然没有用大表测试它。 Let me know is there any scenario which could affect or fail with this code.让我知道是否有任何可能影响或失败此代码的情况。 I'm new to python so that I can improve my knowledge :)我是 python 的新手，所以我可以提高我的知识:)

import os
from tabula import wrapper
os.chdir("E:/Documents/myPy/")
tables = wrapper.read_pdf("MyPDF.pdf",multiple_tables=True,pages='all',encoding='utf-8',spreadsheet=True)

i=1
for table in tables:
    table.columns = table.iloc[0]
    table = table.reindex(table.index.drop(0)).reset_index(drop=True)
    table.columns.name = None
    #To write Excel
    table.to_excel('output'+str(i)+'.xlsx',header=True,index=False)
    #To write CSV
    table.to_csv('output'+str(i)+'.csv',sep='|',header=True,index=False)
    i=i+1

Answer 2

Even when using the tabula-py wrapper you can use all the same options as can be found on the Tabula Java Docs.即使使用 tabula-py 包装器，您也可以使用 Tabula Java Docs 上提供的所有相同选项。

In your case you can simply add pages = "all":在您的情况下，您可以简单地添加 pages = "all"：

from tabula import read_pdf
df = read_pdf(r"C:\Users\Himanshu Poddar\Desktop\pdf_file.pdf", pages ="all")

Answer 3

如果您的 PDF 有多个表，您可以使用multiple_tables=true选项。

Answer 4

using multiple_tables=true parameter in the read_pdf will solve the issue在 read_pdf 中使用multiple_tables=true参数将解决问题

Example ::示例::

from tabula import wrapper
df = wrapper.read_pdf("sample.pdf",multiple_tables=True)

Now the read_pdf is in wrapper, so we need to import that and use as shown above现在 read_pdf 在包装器中，所以我们需要导入它并使用如上所示

Answer 5

If the tables have the same structure(ie, have the same table structure and the same relative position) in all pages of pdf, then you can set pages='all' to get the correct result.如果pdf的所有页面中表格的结构相同（即具有相同的表格结构和相同的相对位置），那么您可以设置 pages='all' 以获得正确的结果。

If not, you may need to iterate all pages to parser the pdf.如果没有，您可能需要迭代所有页面来解析 pdf。

There are a documention that explains it in detail.有一个文档详细解释了它。

Answer 6

If only one table is present in a pdf file then that can be simply extracted using the code如果pdf文件中只有一个表格，则可以使用以下代码简单地将其提取出来

from tabula import read_pdf
df = read_pdf(r"C:\Users\Himanshu Poddar\Desktop\pdf_file.pdf")

But if there is more than one table present in a pdf file.I am unable to extract those tables.Because its only extracting the first one.但是，如果pdf文件中存在多个表，则我无法提取这些表，因为它仅提取第一个表。

如何使用python中的tabula提取pdf文件中存在的多个表？

问题描述

5 个解决方案

解决方案1
3 2019-03-16 21:08:57

解决方案2
2 2018-07-19 08:59:49

解决方案3
0 2018-09-25 11:37:42

解决方案4
0 2019-09-16 12:58:08

解决方案5
0 2019-12-08 12:14:57

解决方案6
0 2020-12-12 18:21:19

如何使用python中的tabula提取pdf文件中存在的多个表？

问题描述

5 个解决方案

解决方案1 3 2019-03-16 21:08:57

解决方案2 2 2018-07-19 08:59:49

解决方案3 0 2018-09-25 11:37:42

解决方案4 0 2019-09-16 12:58:08

解决方案5 0 2019-12-08 12:14:57

解决方案6 0 2020-12-12 18:21:19

解决方案1
3 2019-03-16 21:08:57

解决方案2
2 2018-07-19 08:59:49

解决方案3
0 2018-09-25 11:37:42

解决方案4
0 2019-09-16 12:58:08

解决方案5
0 2019-12-08 12:14:57

解决方案6
0 2020-12-12 18:21:19