提取PDF表格，Python3，tabula-py

Question

嘗試使用 Python 3.6 從 PDF 中提取表格。 似乎 [pyPDF2][1] 失敗並且 [pdfminer][2] 與 3.x 不兼容。 我找到了 [tabula][3] 的 python 包裝器。

import tabula
file_list = get_pdf_list()

text = tabula.read_pdf(file_list[0])
print(text)

tabula.convert_into(file_list[0], "test.json", ouput_format="json")

read_pdf 和 convert_into 都返回空結果。 PyPDF2 有同樣的問題。 運行時沒有錯誤

我開始認為這與我的 pdf 格式有關。 有人有更多經驗嗎？ 我正在嘗試從 pdf 的表格中提取一個值。

Answer 1

希望您已經得到答案！ 但是這里仍然是我的代碼！ 我想說表格是PDF表格提取器中的一種。 我在駱駝上遇到了很多問題。

安裝最新的pkg表格

pip install tabula-py

代碼是！

import os
from tabula import wrapper
os.path.abspath("E:/Documents/myPy/")
tables = wrapper.read_pdf("MyPDF.pdf",multiple_tables=True,pages='all')

i=1
for table in tables:
    table.to_excel('output'+str(i)+'.xlsx',index=False)
    print(i)
    i=i+1

試試看！

Answer 2

提取PDF表格，Python3，tabula-py

可能是表格沒有邊界，這與 tabula-py 具有其特征的普通文本不同

stream if true 根據文本排列搜索表的行和列
格如果定義了一個表的行和列的適當邊界真正的搜索
convert_into(table_file, output_csv, output_format='csv',lattice=False,stream=True, pages=1)

提取PDF表格，Python3，tabula-py

問題描述

2 個解決方案

解決方案1
1 2019-03-16 21:21:26

解決方案2
0 2021-11-27 11:20:01

提取PDF表格，Python3，tabula-py

問題描述

2 個解決方案

解決方案1 1 2019-03-16 21:21:26

解決方案2 0 2021-11-27 11:20:01

解決方案1
1 2019-03-16 21:21:26

解決方案2
0 2021-11-27 11:20:01