簡體   English   中英

tabula-py 不能與某些 pdf 文件一起運行

[英]tabula-py not run with some pdf file

我正在嘗試通過表格(python)從一些 pdf 中提取表格

我在某些文件 pdf 中遇到了如下錯誤。

tables = read_pdf(file_path, pages = 'all')
Error from tabula-java:
Error: File does not exist


Traceback (most recent call last):

  Input In [71] in <cell line: 1>
    tables = read_pdf(file_path, pages = 'all')

  File ~\anaconda3\lib\site-packages\tabula\io.py:322 in read_pdf
    output = _run(java_options, kwargs, path, encoding)

  File ~\anaconda3\lib\site-packages\tabula\io.py:80 in _run
    result = subprocess.run(

  File ~\anaconda3\lib\subprocess.py:516 in run
    raise CalledProcessError(retcode, process.args,

CalledProcessError: Command '['java', '-Dfile.encoding=UTF8', '-jar', 'C:\\Users\\xxx\\anaconda3\\lib\\site-packages\\tabula\\tabula-1.0.5-jar-with-dependencies.jar', '--pages', 'all', '--guess', '--format', 'JSON', 'C:/Users/xx/yyy/Invoice/75211-INV-1180235.PDF']' returned non-zero exit status 1.

似乎是 java 的錯誤。 但我仍然可以完美地從其他 pdf 文件中提取 dataframe。

我還嘗試從 tabula.exe 中提取表(它將在瀏覽器中的地址http://127.0.0.1:8080中運行)。 它適用於所有 pdf 文件(包括文件在嘗試通過代碼運行時遇到錯誤)

--------------更新打印日志------

print(file_path)  # 1. print the file-path before using tabula on it
# 2a. the try-except block can catch error output
try:
    tables = read_pdf(file_path, pages = 'all')
except Exception as e:
    print(e)  # 2b. print the error-output or exception
C:/Users/quock/tapetco/Kinh Doanh - Documents/Chứng Từ/Foreign Airports/AEG/Invoice/error/75211-INV-1180235.PDF
Error from tabula-java:
Error: File does not exist


Command '['java', '-Dfile.encoding=UTF8', '-jar', 'C:\\Users\\xxx\\anaconda3\\lib\\site-packages\\tabula\\tabula-1.0.5-jar-with-dependencies.jar', '--pages', 'all', '--guess', '--format', 'JSON', 'C:/Users/xx/yyy/Invoice/75211-INV-1180235.PDF']' returned non-zero exit status 1.

我還更新了 pdf 文件文件:75211-INV-1180235.pdf 產生的錯誤文件:APAG_20170615.Z437175BA4191210EE0094E1 工作正常

產生錯誤的文件 PDF

嘗試使用 print 進行調試

嘗試調試您的腳本:

  1. 打印記錄您作為參數傳遞給 tabula 的文件
  2. 從表格打印記錄 output
print(file_path)  # 1. print the file-path before using tabula on it
# 2a. the try-except block can catch error output
try:
    tables = read_pdf(file_path, pages = 'all')
except Exception as e:
    print(e)  # 2b. print the error-output or exception

提到的錯誤表明作為參數傳遞給 tabula 的文件不存在:

來自 tabula-java 的錯誤:錯誤:文件不存在

也可以看看:

用 2 個給定文件復制

我使用pip3 install tabula-py並准備了這個腳本來針對 tabula 運行每個給定的文件:

腳本SO_tablua.py

import sys
import tabula

if len(sys.argv) < 2:
    print('Missing required argument. Usage: py <PDF>.')
    exit(1)

pdf = sys.argv[1]
print(f"Extracting tables from '{pdf}' using tabula-py with option 'pages=all'..")

try:
    # Read pdf into list of DataFrame
    dfs = tabula.read_pdf(pdf, pages='all')
    print(f"Result:\n{dfs}")
except Exception as e:
    print(f"Error from tabula-py: {e}")
    exit(1)

對於 2 個給定的文件,它可以正常工作:

❯ python3 SO_tabula.py 75211-INV-1180235.PDF
Extracting tables from '75211-INV-1180235.PDF' using tabula-py with option 'pages=all'..
Result:
[                     Unnamed: 0     Invoice #     1180235
0                           NaN  Invoice Date  11/29/2021
1                           NaN         Terms       NET15
2                           NaN      Due Date  12/14/2021
3                           NaN      Currency         USD
4            SERVICE LOCATION :    Customer #       75211
5  Airport: VOMM  - CHENNAI, IN          Page           1, Empty DataFrame
Columns: [No, Trans.Date, Item Desc, Ref. #, Equip. ID, Flight #, Qty, UOM, Unit Price, Extended Price]
Index: []]

Output 為第二個 PDF 截斷:

❯ python3 SO_tabula.py APAG_20170615.pdf
Extracting tables from 'APAG_20170615.pdf' using tabula-py with option 'pages=all'..
Result:
[   Unnamed: 0   ...

對於虛構的(不存在的)文件,它顯示了報告的錯誤:

❯ python3 SO_tabula.py APAG_20170615.pdf_
Extracting tables from 'APAG_20170615.pdf_' using tabula-py with option 'pages=all'..
Error from tabula-py: [Errno 2] No such file or directory: 'APAG_20170615.pdf_'

進一步的分析

假設所有給定的文件都存在並且可以從您的腳本中訪問,那么 tabula 本身或其 python-wrapper 似乎存在問題。

為了進一步分析這一點,我通常會查看 tabula 的日志或搜索(命令行)選項(在 tabula.jar 或 tabula-py 中)以顯示詳細調試 output。 但我沒有找到任何這樣的選擇。

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM