简体   繁体   中英

How to extract more than one table present in a pdf file with tabula in python?

If only one table is present in a pdf file then that can be simply extracted using the code

from tabula import read_pdf
df = read_pdf(r"C:\Users\Himanshu Poddar\Desktop\pdf_file.pdf")

But if there is more than one table present in a pdf file.I am unable to extract those tables.Because its only extracting the first one.

There? Hope the below code will be helpful, still I didn't test it with large tables. Let me know is there any scenario which could affect or fail with this code. I'm new to python so that I can improve my knowledge :)

import os
from tabula import wrapper
os.chdir("E:/Documents/myPy/")
tables = wrapper.read_pdf("MyPDF.pdf",multiple_tables=True,pages='all',encoding='utf-8',spreadsheet=True)

i=1
for table in tables:
    table.columns = table.iloc[0]
    table = table.reindex(table.index.drop(0)).reset_index(drop=True)
    table.columns.name = None
    #To write Excel
    table.to_excel('output'+str(i)+'.xlsx',header=True,index=False)
    #To write CSV
    table.to_csv('output'+str(i)+'.csv',sep='|',header=True,index=False)
    i=i+1

Even when using the tabula-py wrapper you can use all the same options as can be found on the Tabula Java Docs.

In your case you can simply add pages = "all":

from tabula import read_pdf
df = read_pdf(r"C:\Users\Himanshu Poddar\Desktop\pdf_file.pdf", pages ="all")

如果您的 PDF 有多个表,您可以使用multiple_tables=true选项。

using multiple_tables=true parameter in the read_pdf will solve the issue

Example ::

from tabula import wrapper
df = wrapper.read_pdf("sample.pdf",multiple_tables=True)

Now the read_pdf is in wrapper, so we need to import that and use as shown above

If the tables have the same structure(ie, have the same table structure and the same relative position) in all pages of pdf, then you can set pages='all' to get the correct result.

If not, you may need to iterate all pages to parser the pdf.

There are a documention that explains it in detail.

If only one table is present in a pdf file then that can be simply extracted using the code

from tabula import read_pdf
df = read_pdf(r"C:\Users\Himanshu Poddar\Desktop\pdf_file.pdf")

But if there is more than one table present in a pdf file.I am unable to extract those tables.Because its only extracting the first one.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM