無法使用表格將 PDF 文件的多個 PDF 頁面轉換為 CSV

Question

我有 PDF 文件，其第一頁數據格式不同，但是頁面的 rest 具有相同的表格格式。 我想使用 Python Tabula 將這個具有多個頁面的 PDF 文件轉換為 CSV 文件。

當前代碼能夠將 PDF 轉換為 CSV 如果 PDF 只有 2 頁，並且如果它有超過兩頁，則會給出超出范圍的錯誤。

I want to count total number of PDF pages of a PDF File and depending upon the same I want python script to convert the PDF to CSV for different data frames.

我正在使用 Linux 框來運行這個 python 腳本。

代碼如下：

#!/usr/bin/env python3

import tabula
import pandas as pd
import csv

pdf_file='/root/scripts/pdf2xls/Test/21KJAZP011.pdf'
column_names=['Product','Batch No','Machin No','Time','Date','Drum/Bag No','Tare Wt.kg','Gross Wt.kg',
              'Net Wt.kg','Blender','Remarks','Operator']
df_results=[] # store results in a list

# Page 1 processing
try:
    df1 = tabula.read_pdf('/root/scripts/pdf2xls/Test/21KJAZP011.pdf', pages=1,area=(95,20, 800, 840),columns=[93,180,220,252,310,315,333,367,
                                                                          410,450,480,520]
                         ,pandas_options={'header': None}) #(top,left,bottom,right)
    df1[0]=df1[0].drop(columns=5)
    df1[0].columns=column_names
    df_results.append(df1[0])
    df1[0].head(2)

except Exception as e:
    print(f"Exception page not found {e}")


# Page 2 processing
try:
    df2 = tabula.read_pdf('/root/scripts/pdf2xls/Test/21KJAZP011.pdf', pages=2,area=(10,20, 800, 840),columns=[93,180,220,252,310,315,330,370,
                                                                          410,450,480,520]
                         ,pandas_options={'header': None}) #(top,left,bottom,right)

    row_with_Sta = df2[0][df2[0][0] == 'Sta'].index.tolist()[0]
    df2[0] = df2[0].iloc[:row_with_Sta]
    df2[0]=df2[0].drop(columns=5)
    df2[0].columns=column_names
    df_results.append(df2[0])
    df2[0].head(2)

except Exception as e:
    print(f"Exception page not found {e}")

#res:wult = pd.concat([df1[0],df2[0],df3[0]]) # concate both the pages and then write to CSV
result = pd.concat(df_results) # concate list of pages and then write to CSV
result.to_csv("result.csv")

with open('/root/scripts/pdf2xls/Test/result.csv', 'r') as f_input, open('/root/scripts/pdf2xls/Test/FinalOutput_21KJAZP011.csv', 'w') as f_output:
    csv_input = csv.reader(f_input)
    csv_output = csv.writer(f_output)
    csv_output.writerow(next(csv_input))    # write header

    for cols in csv_input:
        for i in range(7, 9):
            cols[i] = '{:.2f}'.format(float(cols[i]))
        csv_output.writerow(cols)

請建議如何實現相同的目標。 我對 Python 很陌生，因此無法整理東西。

Answer 1

試試 pdfpumber https://github.com/jsvine/pdfplumber ，像魅力一樣為我工作

pdffile = 'your file'
with pdfplumber.open(pdffile) as pdf:
    for i in range(len(pdf.pages)):
        first_page = pdf.pages[i]
        rawdata = first_page.extract_table()

Answer 2

從 PDF 中提取多個表

多個表=真

from tabula import convert_into
table_file = r"PDF_path"
output_csv = r"out_csv"
df = convert_into(table_file, output_csv, output_format='csv', lattice=False, stream=True, multiple_tables=True, pages="all")

無法使用表格將 PDF 文件的多個 PDF 頁面轉換為 CSV

問題描述

2 個解決方案

解決方案1
0 2021-11-18 15:32:37

解決方案2
0 2021-11-27 12:25:32

從 PDF 中提取多個表

無法使用表格將 PDF 文件的多個 PDF 頁面轉換為 CSV

問題描述

2 個解決方案

解決方案1 0 2021-11-18 15:32:37

解決方案2 0 2021-11-27 12:25:32

從 PDF 中提取多個表

解決方案1
0 2021-11-18 15:32:37

解決方案2
0 2021-11-27 12:25:32