[英]How to extract more than one table present in a pdf file with tabula in python?
If only one table is present in a pdf file then that can be simply extracted using the code如果pdf文件中只有一个表格,那么可以使用代码简单地提取
from tabula import read_pdf
df = read_pdf(r"C:\Users\Himanshu Poddar\Desktop\pdf_file.pdf")
But if there is more than one table present in a pdf file.I am unable to extract those tables.Because its only extracting the first one.但是如果pdf文件中存在多个表。我无法提取这些表。因为它只提取第一个。
There?在那里? Hope the below code will be helpful, still I didn't test it with large tables.
希望下面的代码会有所帮助,但我仍然没有用大表测试它。 Let me know is there any scenario which could affect or fail with this code.
让我知道是否有任何可能影响或失败此代码的情况。 I'm new to python so that I can improve my knowledge :)
我是 python 的新手,所以我可以提高我的知识:)
import os
from tabula import wrapper
os.chdir("E:/Documents/myPy/")
tables = wrapper.read_pdf("MyPDF.pdf",multiple_tables=True,pages='all',encoding='utf-8',spreadsheet=True)
i=1
for table in tables:
table.columns = table.iloc[0]
table = table.reindex(table.index.drop(0)).reset_index(drop=True)
table.columns.name = None
#To write Excel
table.to_excel('output'+str(i)+'.xlsx',header=True,index=False)
#To write CSV
table.to_csv('output'+str(i)+'.csv',sep='|',header=True,index=False)
i=i+1
Even when using the tabula-py wrapper you can use all the same options as can be found on the Tabula Java Docs.即使使用 tabula-py 包装器,您也可以使用 Tabula Java Docs 上提供的所有相同选项。
In your case you can simply add pages = "all":在您的情况下,您可以简单地添加 pages = "all":
from tabula import read_pdf
df = read_pdf(r"C:\Users\Himanshu Poddar\Desktop\pdf_file.pdf", pages ="all")
如果您的 PDF 有多个表,您可以使用multiple_tables=true
选项。
using multiple_tables=true
parameter in the read_pdf will solve the issue在 read_pdf 中使用
multiple_tables=true
参数将解决问题
Example ::示例::
from tabula import wrapper
df = wrapper.read_pdf("sample.pdf",multiple_tables=True)
Now the read_pdf is in wrapper, so we need to import that and use as shown above现在 read_pdf 在包装器中,所以我们需要导入它并使用如上所示
If the tables have the same structure(ie, have the same table structure and the same relative position) in all pages of pdf, then you can set pages='all' to get the correct result.如果pdf的所有页面中表格的结构相同(即具有相同的表格结构和相同的相对位置),那么您可以设置 pages='all' 以获得正确的结果。
If not, you may need to iterate all pages to parser the pdf.如果没有,您可能需要迭代所有页面来解析 pdf。
There are a documention that explains it in detail.有一个文档详细解释了它。
If only one table is present in a pdf file then that can be simply extracted using the code如果pdf文件中只有一个表格,则可以使用以下代码简单地将其提取出来
from tabula import read_pdf
df = read_pdf(r"C:\Users\Himanshu Poddar\Desktop\pdf_file.pdf")
But if there is more than one table present in a pdf file.I am unable to extract those tables.Because its only extracting the first one.但是,如果pdf文件中存在多个表,则我无法提取这些表,因为它仅提取第一个表。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.