简体   繁体   English

如何使用python中的tabula提取pdf文件中存在的多个表?

[英]How to extract more than one table present in a pdf file with tabula in python?

If only one table is present in a pdf file then that can be simply extracted using the code如果pdf文件中只有一个表格,那么可以使用代码简单地提取

from tabula import read_pdf
df = read_pdf(r"C:\Users\Himanshu Poddar\Desktop\pdf_file.pdf")

But if there is more than one table present in a pdf file.I am unable to extract those tables.Because its only extracting the first one.但是如果pdf文件中存在多个表。我无法提取这些表。因为它只提取第一个。

There?在那里? Hope the below code will be helpful, still I didn't test it with large tables.希望下面的代码会有所帮助,但我仍然没有用大表测试它。 Let me know is there any scenario which could affect or fail with this code.让我知道是否有任何可能影响或失败此代码的情况。 I'm new to python so that I can improve my knowledge :)我是 python 的新手,所以我可以提高我的知识:)

import os
from tabula import wrapper
os.chdir("E:/Documents/myPy/")
tables = wrapper.read_pdf("MyPDF.pdf",multiple_tables=True,pages='all',encoding='utf-8',spreadsheet=True)

i=1
for table in tables:
    table.columns = table.iloc[0]
    table = table.reindex(table.index.drop(0)).reset_index(drop=True)
    table.columns.name = None
    #To write Excel
    table.to_excel('output'+str(i)+'.xlsx',header=True,index=False)
    #To write CSV
    table.to_csv('output'+str(i)+'.csv',sep='|',header=True,index=False)
    i=i+1

Even when using the tabula-py wrapper you can use all the same options as can be found on the Tabula Java Docs.即使使用 tabula-py 包装器,您也可以使用 Tabula Java Docs 上提供的所有相同选项。

In your case you can simply add pages = "all":在您的情况下,您可以简单地添加 pages = "all":

from tabula import read_pdf
df = read_pdf(r"C:\Users\Himanshu Poddar\Desktop\pdf_file.pdf", pages ="all")

如果您的 PDF 有多个表,您可以使用multiple_tables=true选项。

using multiple_tables=true parameter in the read_pdf will solve the issue在 read_pdf 中使用multiple_tables=true参数将解决问题

Example ::示例::

from tabula import wrapper
df = wrapper.read_pdf("sample.pdf",multiple_tables=True)

Now the read_pdf is in wrapper, so we need to import that and use as shown above现在 read_pdf 在包装器中,所以我们需要导入它并使用如上所示

If the tables have the same structure(ie, have the same table structure and the same relative position) in all pages of pdf, then you can set pages='all' to get the correct result.如果pdf的所有页面中表格的结构相同(即具有相同的表格结构和相同的相对位置),那么您可以设置 pages='all' 以获得正确的结果。

If not, you may need to iterate all pages to parser the pdf.如果没有,您可能需要迭代所有页面来解析 pdf。

There are a documention that explains it in detail.有一个文档详细解释了它。

If only one table is present in a pdf file then that can be simply extracted using the code如果pdf文件中只有一个表格,则可以使用以下代码简单地将其提取出来

from tabula import read_pdf
df = read_pdf(r"C:\Users\Himanshu Poddar\Desktop\pdf_file.pdf")

But if there is more than one table present in a pdf file.I am unable to extract those tables.Because its only extracting the first one.但是,如果pdf文件中存在多个表,则我无法提取这些表,因为它仅提取第一个表。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何使用 Pandas 和 tabula-py 从一个 PDF 文件中提取多个表格 - How to extract multiples tables from one PDF file using Pandas and tabula-py Python 使用表格从 pdf 中提取两个表格之间的文本作为表格(外部表格)的标题 - Python extract text between two tables as title for the table(outside tables) from pdf with tabula 在python中使用tabula读取pdf文件 - Reading pdf file using tabula in python 表格从 pdf 中提取表格删除换行符 - tabula extract table from pdf remove line break 当 Python tabula-py 在表格中出现换行符时,如何读取 pdf 中的表格? - How to read tables in pdf when there is line breaks in table by Python tabula-py? 如何提取 pdf 中存在的表列数据并存储在变量 python 中 - how to extract a table column data present in pdf and stored inside a variable python 提取PDF表格,Python3,tabula-py - Extracting PDF table, Python3, tabula-py 如何在熊猫中多次提取字符串中存在的单个模式 - How to extract a single pattern present in a string more than once in pandas 如何使用 python(pdfminer,minecart,tabula ...)检测 PDF 文件中的彩色块 - How to detect colored blocks in a PDF file with python (pdfminer, minecart, tabula...) 从 PDF 表中导入旋转文本,例如 python 中的 tabula-py - Importing rotated text from a PDF table such as with tabula-py in python
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM