如何使用 Python Pandas 將 PDF 文件轉換為 CSV 文件

Question

我有一個 PDF 文件，我需要將它轉換成一個 CSV 文件這是我的 pdf 文件示例作為鏈接https://online.flippingbook.com/view/352975479/使用的代碼是

import re
import parse
import pdfplumber
import pandas as pd
from collections import namedtuple
file = "Battery Voltage.pdf"
lines = []
total_check = 0

with pdfplumber.open(file) as pdf:
    pages = pdf.pages
    for page in pdf.pages:
        text = page.extract_text()
        for line in text.split('\n'):
            print(line)

使用上面的腳本我沒有得到正確的 output，對於時間列“AM”正在進入下一行。 我拿到的output是這樣的

Answer 1

它可以幫助您了解 pdf 的表面如何顯示到屏幕上。 這樣一串純文本就可以一部分一部分地顯示在顯示器上。 （這里我強調了第一個 AM 的放置位置。

作為一個附帶問題，我認為文件中的第一個 AM 乍一看編碼為這個塊

BT
/F1 12 Tf
1 0 0 1 224.20265 754.6322 Tm
[<001D001E>] TJ
ET

在該區域中 1D = A 和 1E = M

因此，如果您希望在顯示時提取每一行，到目前為止，最簡單的方法是使用 pdftotext 之類的庫，它專門輸出頁面上看到的每一行文本。

因此，使用表格逗號分隔等攻擊，您可以預期每個AM都會有自己的行。 按邏輯應該是" ",AM," "," "但是一些提取器應該說nan,AM,nan,nan

作為文本，它看起來像這樣來自一個可編程的行

pdftotext -layout "Battery Voltage.pdf"

這將 output "Battery Voltage.txt" 在同一個工作文件夾中

然后將其放入電子表格中

現在我們可以通過幾次點擊（不再）將 csv 連同 csv 包含的所有奇怪內容導出為“正確的輸出”。

,,Battery Vo,ltage,




Sr No,DateT,Ime,Voltage (v),Ignition
1,01/11/2022,00:08:10,47.15,Off
,AM,,,
2,01/11/2022,00:23:10,47.15,Off
,AM,,,
3,01/11/2022,00:38:10,47.15,Off
,AM,,,
4,01/11/2022,00:58:10,47.15,Off
,AM,,,
5,01/11/2022,01:18:10,47.15,Off
,AM,,,
6,01/11/2022,01:33:10,47.15,Off
,AM,,,
7,01/11/2022,01:48:10,47.15,Off
,AM,,,
8,01/11/2022,02:03:10,47.15,Off
,AM,,,
9,01/11/2022,02:18:10,47.15,Off
,AM,,,
10,01/11/2022,02:37:12,47.15,Off
,AM,,,

所以，如果在 csv 生成之前沒有完成編輯，那么在編輯器中發布過程會更簡單，就像這個 html 頁面（不需要更多應用程序）

,,Battery,Voltage,
Sr No,Date,Time,Voltage (v),Ignition
1,01/11/2022,00:08:10,47.15,Off,AM,,,
2,01/11/2022,00:23:10,47.15,Off,AM,,,
3,01/11/2022,00:38:10,47.15,Off,AM,,,
4,01/11/2022,00:58:10,47.15,Off,AM,,,
5,01/11/2022,01:18:10,47.15,Off,AM,,,
6,01/11/2022,01:33:10,47.15,Off,AM,,,
7,01/11/2022,01:48:10,47.15,Off,AM,,,
8,01/11/2022,02:03:10,47.15,Off,AM,,,
9,01/11/2022,02:18:10,47.15,Off,AM,,,
10,01/11/2022,02:37:12,47.15,Off,AM,,,

然后重新導入它看起來更像是人為生成的

在討論中，確認所需的只是一種結構化列表的方法，並首先使用
pdftotext -layout -nopgbrk -x 0 -y 60 -W 800 -H 800 -fixed 6 "Battery Voltage.pdf" &type "battery voltage.txt"|findstr "O">battery.txt

將 output 規定的數據列用於框架，固定標題或拆分或以其他方式使用清潔數據。

                 1            01-11-2022 00:08:10         47.15                 Off
                 2            01-11-2022 00:23:10         47.15                 Off
                 3            01-11-2022 00:38:10         47.15                 Off
                 4            01-11-2022 00:58:10         47.15                 Off
                 5            01-11-2022 01:18:10         47.15                 Off
...
               32357          24-11-2022 17:48:43         45.40                  On
               32358          24-11-2022 17:48:52         44.51                  On
               32359          24-11-2022 17:48:55         44.51                  On
               32360          24-11-2022 17:48:58         44.51                  On
               32361          24-11-2022 17:48:58         44.51                  On

這個階段我們可以使用文本處理如csv或者加上json括號

for /f "tokens=1,2,3,4,5 delims= " %%a In ('Findstr /C:"O" battery.txt') do echo csv is "%%a,%%b,%%c,%%d,%%e">output.txt
...
csv is "32357,24-11-2022,17:48:43,45.40,On"
csv is "32358,24-11-2022,17:48:52,44.51,On"
csv is "32359,24-11-2022,17:48:55,44.51,On"
csv is "32360,24-11-2022,17:48:58,44.51,On"
csv is "32361,24-11-2022,17:48:58,44.51,On"

所以請求是 JSON（不是我的強項，所以你可能需要改進我的代碼，因為我不知道 mongo 期望什么）

在這里，我將 pdf 放到 battery.bat 上

{"line_id":1,"created":{"date":"01-11-2022"},{"time":"00:08:10"},{"Voltage":"47.15"},{"State","Off"}}
{"line_id":2,"created":{"date":"01-11-2022"},{"time":"00:23:10"},{"Voltage":"47.15"},{"State","Off"}}
{"line_id":3,"created":{"date":"01-11-2022"},{"time":"00:38:10"},{"Voltage":"47.15"},{"State","Off"}}
{"line_id":4,"created":{"date":"01-11-2022"},{"time":"00:58:10"},{"Voltage":"47.15"},{"State","Off"}}
{"line_id":5,"created":{"date":"01-11-2022"},{"time":"01:18:10"},{"Voltage":"47.15"},{"State","Off"}}
{"line_id":6,"created":{"date":"01-11-2022"},{"time":"01:33:10"},{"Voltage":"47.15"},{"State","Off"}}
{"line_id":7,"created":{"date":"01-11-2022"},{"time":"01:48:10"},{"Voltage":"47.15"},{"State","Off"}}
{"line_id":8,"created":{"date":"01-11-2022"},{"time":"02:03:10"},{"Voltage":"47.15"},{"State","Off"}}
{"line_id":9,"created":{"date":"01-11-2022"},{"time":"02:18:10"},{"Voltage":"47.15"},{"State","Off"}}
{"line_id":10,"created":{"date":"01-11-2022"},{"time":"02:37:12"},{"Voltage":"47.15"},{"State","Off"}}

它在純控制台中運行有點慢，所以讓我們通過添加@來盲目地運行它，它仍然需要時間，因為我們在純文本中工作，所以預計 32,000 多行 = 2+1/2 分鍾的顯着延遲我的裝備

pdftotext -layout -nopgbrk -x 0 -y 60 -W 700 -H 800 -fixed 8 "%~1" battery.txt

echo Heading however you wish it for json perhaps just opener [ but note only one redirect chevron >"%~dpn1.txt"

for /f "tokens=1,2,3,4,5 delims= " %%a In ('Findstr /C:"O" battery.txt') do @echo  "%%a": { "Date": "%%b", "Time": "%%c", "Voltage": %%d, "Ignition": "%%e" },>>"%~dpn1.txt"
REM another json style could be  { "Line_Id": %%a, "Date": "%%b", "Time": "%%c", "Voltage": %%d, "Ignition": "%%e" },
REM another for an array can simply be [%%a,"%%b","%%c",%%d,"%%e" ],

echo Tailing however you wish it for json perhaps just final closer ] but note double chevron >>"%~dpn1.txt"

要查看進度， @echo {更改為@echo %%a&echo {

因此，大約一分鍾后 但是，它往往會為所有顯示活動增加額外的一分鍾。 在 window 關閉之前作為完成的標志。

Answer 2

對於這些情況，構建一個解析器，將不可用的數據轉換為您可以使用的數據。

下面的邏輯將該確切文件轉換為 CSV，但僅適用於該特定文件內容。

請注意，對於此特定文件，您可以忽略 AM/PM，因為時間采用 24 小時格式。

import pdfplumber


file = "Battery Voltage.pdf"
skiplines = [
    "Battery Voltage",
    "AM",
    "PM",
    "Sr No DateTIme Voltage (v) Ignition",
    ""
]


with open("output.csv", "w") as outfile:
    header = "serialnumber;date;time;voltage;ignition\n"
    outfile.write(header)
    with pdfplumber.open(file) as pdf:
        for page in pdf.pages:
            for line in page.extract_text().split('\n'):
                if line.strip() in skiplines:
                    continue
                outfile.write(";".join(line.split())+"\n")

編輯

因此，python 中的 JSON 個文件基本上只是字典項的列表（是的，這過於簡單化了）。

您唯一需要更改的是實際處理線條的方式。 邏輯的實際內容沒有改變......

import pdfplumber
import json


file = "Battery Voltage.pdf"
skiplines = [
    "Battery Voltage",
    "AM",
    "PM",
    "Sr No DateTIme Voltage (v) Ignition",
    ""
]
result = []


with pdfplumber.open(file) as pdf:
    for page in pdf.pages:
        for line in page.extract_text().split("\n"):
            if line.strip() in skiplines:
                continue
            serialnumber, date, time, voltage, ignition = line.split()
            result.append(
                {
                    "serialnumber": serialnumber,
                    "date": date,
                    "time": time,
                    "voltage": voltage,
                    "ignition": ignition,
                }
            )

with open("output.json", "w") as outfile:
    json.dump(result, outfile)

如何使用 Python Pandas 將 PDF 文件轉換為 CSV 文件

問題描述

2 個解決方案

解決方案1
2 2022-11-26 13:08:52

解決方案2
1 2022-11-26 09:38:28

如何使用 Python Pandas 將 PDF 文件轉換為 CSV 文件

問題描述

2 個解決方案

解決方案1 2 2022-11-26 13:08:52

解決方案2 1 2022-11-26 09:38:28

解決方案1
2 2022-11-26 13:08:52

解決方案2
1 2022-11-26 09:38:28