[英]tabula-py not run with some pdf file
我正在嘗試通過表格(python)從一些 pdf 中提取表格
我在某些文件 pdf 中遇到了如下錯誤。
tables = read_pdf(file_path, pages = 'all')
Error from tabula-java:
Error: File does not exist
Traceback (most recent call last):
Input In [71] in <cell line: 1>
tables = read_pdf(file_path, pages = 'all')
File ~\anaconda3\lib\site-packages\tabula\io.py:322 in read_pdf
output = _run(java_options, kwargs, path, encoding)
File ~\anaconda3\lib\site-packages\tabula\io.py:80 in _run
result = subprocess.run(
File ~\anaconda3\lib\subprocess.py:516 in run
raise CalledProcessError(retcode, process.args,
CalledProcessError: Command '['java', '-Dfile.encoding=UTF8', '-jar', 'C:\\Users\\xxx\\anaconda3\\lib\\site-packages\\tabula\\tabula-1.0.5-jar-with-dependencies.jar', '--pages', 'all', '--guess', '--format', 'JSON', 'C:/Users/xx/yyy/Invoice/75211-INV-1180235.PDF']' returned non-zero exit status 1.
似乎是 java 的錯誤。 但我仍然可以完美地從其他 pdf 文件中提取 dataframe。
我還嘗試從 tabula.exe 中提取表(它將在瀏覽器中的地址http://127.0.0.1:8080中運行)。 它適用於所有 pdf 文件(包括文件在嘗試通過代碼運行時遇到錯誤)
--------------更新打印日志------
print(file_path) # 1. print the file-path before using tabula on it
# 2a. the try-except block can catch error output
try:
tables = read_pdf(file_path, pages = 'all')
except Exception as e:
print(e) # 2b. print the error-output or exception
C:/Users/quock/tapetco/Kinh Doanh - Documents/Chứng Từ/Foreign Airports/AEG/Invoice/error/75211-INV-1180235.PDF
Error from tabula-java:
Error: File does not exist
Command '['java', '-Dfile.encoding=UTF8', '-jar', 'C:\\Users\\xxx\\anaconda3\\lib\\site-packages\\tabula\\tabula-1.0.5-jar-with-dependencies.jar', '--pages', 'all', '--guess', '--format', 'JSON', 'C:/Users/xx/yyy/Invoice/75211-INV-1180235.PDF']' returned non-zero exit status 1.
我還更新了 pdf 文件文件:75211-INV-1180235.pdf 產生的錯誤文件:APAG_20170615.Z437175BA4191210EE0094E1 工作正常
嘗試調試您的腳本:
print(file_path) # 1. print the file-path before using tabula on it
# 2a. the try-except block can catch error output
try:
tables = read_pdf(file_path, pages = 'all')
except Exception as e:
print(e) # 2b. print the error-output or exception
提到的錯誤表明作為參數傳遞給 tabula 的文件不存在:
來自 tabula-java 的錯誤:錯誤:文件不存在
也可以看看:
我使用pip3 install tabula-py
並准備了這個腳本來針對 tabula 運行每個給定的文件:
腳本SO_tablua.py
:
import sys
import tabula
if len(sys.argv) < 2:
print('Missing required argument. Usage: py <PDF>.')
exit(1)
pdf = sys.argv[1]
print(f"Extracting tables from '{pdf}' using tabula-py with option 'pages=all'..")
try:
# Read pdf into list of DataFrame
dfs = tabula.read_pdf(pdf, pages='all')
print(f"Result:\n{dfs}")
except Exception as e:
print(f"Error from tabula-py: {e}")
exit(1)
對於 2 個給定的文件,它可以正常工作:
❯ python3 SO_tabula.py 75211-INV-1180235.PDF
Extracting tables from '75211-INV-1180235.PDF' using tabula-py with option 'pages=all'..
Result:
[ Unnamed: 0 Invoice # 1180235
0 NaN Invoice Date 11/29/2021
1 NaN Terms NET15
2 NaN Due Date 12/14/2021
3 NaN Currency USD
4 SERVICE LOCATION : Customer # 75211
5 Airport: VOMM - CHENNAI, IN Page 1, Empty DataFrame
Columns: [No, Trans.Date, Item Desc, Ref. #, Equip. ID, Flight #, Qty, UOM, Unit Price, Extended Price]
Index: []]
Output 為第二個 PDF 截斷:
❯ python3 SO_tabula.py APAG_20170615.pdf
Extracting tables from 'APAG_20170615.pdf' using tabula-py with option 'pages=all'..
Result:
[ Unnamed: 0 ...
對於虛構的(不存在的)文件,它顯示了報告的錯誤:
❯ python3 SO_tabula.py APAG_20170615.pdf_
Extracting tables from 'APAG_20170615.pdf_' using tabula-py with option 'pages=all'..
Error from tabula-py: [Errno 2] No such file or directory: 'APAG_20170615.pdf_'
假設所有給定的文件都存在並且可以從您的腳本中訪問,那么 tabula 本身或其 python-wrapper 似乎存在問題。
為了進一步分析這一點,我通常會查看 tabula 的日志或搜索(命令行)選項(在 tabula.jar 或 tabula-py 中)以顯示詳細調試 output。 但我沒有找到任何這樣的選擇。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.