tabula-py 不能與某些 pdf 文件一起運行

Question

我正在嘗試通過表格（python）從一些 pdf 中提取表格

我在某些文件 pdf 中遇到了如下錯誤。

tables = read_pdf(file_path, pages = 'all')
Error from tabula-java:
Error: File does not exist


Traceback (most recent call last):

  Input In [71] in <cell line: 1>
    tables = read_pdf(file_path, pages = 'all')

  File ~\anaconda3\lib\site-packages\tabula\io.py:322 in read_pdf
    output = _run(java_options, kwargs, path, encoding)

  File ~\anaconda3\lib\site-packages\tabula\io.py:80 in _run
    result = subprocess.run(

  File ~\anaconda3\lib\subprocess.py:516 in run
    raise CalledProcessError(retcode, process.args,

CalledProcessError: Command '['java', '-Dfile.encoding=UTF8', '-jar', 'C:\\Users\\xxx\\anaconda3\\lib\\site-packages\\tabula\\tabula-1.0.5-jar-with-dependencies.jar', '--pages', 'all', '--guess', '--format', 'JSON', 'C:/Users/xx/yyy/Invoice/75211-INV-1180235.PDF']' returned non-zero exit status 1.

似乎是 java 的錯誤。 但我仍然可以完美地從其他 pdf 文件中提取 dataframe。

我還嘗試從 tabula.exe 中提取表（它將在瀏覽器中的地址http://127.0.0.1:8080中運行）。 它適用於所有 pdf 文件（包括文件在嘗試通過代碼運行時遇到錯誤）

--------------更新打印日志------

print(file_path)  # 1. print the file-path before using tabula on it
# 2a. the try-except block can catch error output
try:
    tables = read_pdf(file_path, pages = 'all')
except Exception as e:
    print(e)  # 2b. print the error-output or exception
C:/Users/quock/tapetco/Kinh Doanh - Documents/Chứng Từ/Foreign Airports/AEG/Invoice/error/75211-INV-1180235.PDF
Error from tabula-java:
Error: File does not exist


Command '['java', '-Dfile.encoding=UTF8', '-jar', 'C:\\Users\\xxx\\anaconda3\\lib\\site-packages\\tabula\\tabula-1.0.5-jar-with-dependencies.jar', '--pages', 'all', '--guess', '--format', 'JSON', 'C:/Users/xx/yyy/Invoice/75211-INV-1180235.PDF']' returned non-zero exit status 1.

我還更新了 pdf 文件文件：75211-INV-1180235.pdf 產生的錯誤文件：APAG_20170615.Z437175BA4191210EE0094E1 工作正常

產生錯誤的文件 PDF

Answer 1

嘗試使用 print 進行調試

嘗試調試您的腳本：

打印記錄您作為參數傳遞給 tabula 的文件
從表格打印記錄 output

print(file_path)  # 1. print the file-path before using tabula on it
# 2a. the try-except block can catch error output
try:
    tables = read_pdf(file_path, pages = 'all')
except Exception as e:
    print(e)  # 2b. print the error-output or exception

提到的錯誤表明作為參數傳遞給 tabula 的文件不存在：

來自 tabula-java 的錯誤：錯誤：文件不存在

也可以看看：

如何在 Python 中打印異常？

用 2 個給定文件復制

我使用pip3 install tabula-py並准備了這個腳本來針對 tabula 運行每個給定的文件：

腳本SO_tablua.py ：

import sys
import tabula

if len(sys.argv) < 2:
    print('Missing required argument. Usage: py <PDF>.')
    exit(1)

pdf = sys.argv[1]
print(f"Extracting tables from '{pdf}' using tabula-py with option 'pages=all'..")

try:
    # Read pdf into list of DataFrame
    dfs = tabula.read_pdf(pdf, pages='all')
    print(f"Result:\n{dfs}")
except Exception as e:
    print(f"Error from tabula-py: {e}")
    exit(1)

對於 2 個給定的文件，它可以正常工作：

❯ python3 SO_tabula.py 75211-INV-1180235.PDF
Extracting tables from '75211-INV-1180235.PDF' using tabula-py with option 'pages=all'..
Result:
[                     Unnamed: 0     Invoice #     1180235
0                           NaN  Invoice Date  11/29/2021
1                           NaN         Terms       NET15
2                           NaN      Due Date  12/14/2021
3                           NaN      Currency         USD
4            SERVICE LOCATION :    Customer #       75211
5  Airport: VOMM  - CHENNAI, IN          Page           1, Empty DataFrame
Columns: [No, Trans.Date, Item Desc, Ref. #, Equip. ID, Flight #, Qty, UOM, Unit Price, Extended Price]
Index: []]

Output 為第二個 PDF 截斷：

❯ python3 SO_tabula.py APAG_20170615.pdf
Extracting tables from 'APAG_20170615.pdf' using tabula-py with option 'pages=all'..
Result:
[   Unnamed: 0   ...

對於虛構的（不存在的）文件，它顯示了報告的錯誤：

❯ python3 SO_tabula.py APAG_20170615.pdf_
Extracting tables from 'APAG_20170615.pdf_' using tabula-py with option 'pages=all'..
Error from tabula-py: [Errno 2] No such file or directory: 'APAG_20170615.pdf_'

進一步的分析

假設所有給定的文件都存在並且可以從您的腳本中訪問，那么 tabula 本身或其 python-wrapper 似乎存在問題。

為了進一步分析這一點，我通常會查看 tabula 的日志或搜索（命令行）選項（在 tabula.jar 或 tabula-py 中）以顯示詳細調試 output。 但我沒有找到任何這樣的選擇。

tabula-py 不能與某些 pdf 文件一起運行

問題描述

1 個解決方案

解決方案1
0 2022-09-11 14:53:26

嘗試使用 print 進行調試

用 2 個給定文件復制

進一步的分析

tabula-py 不能與某些 pdf 文件一起運行

問題描述

1 個解決方案

解決方案1 0 2022-09-11 14:53:26

嘗試使用 print 進行調試

用 2 個給定文件復制

進一步的分析

解決方案1
0 2022-09-11 14:53:26