[英]How to Convert PDF file into CSV file using Python Pandas
我有一個 PDF 文件,我需要將它轉換成一個 CSV 文件這是我的 pdf 文件示例作為鏈接https://online.flippingbook.com/view/352975479/使用的代碼是
import re
import parse
import pdfplumber
import pandas as pd
from collections import namedtuple
file = "Battery Voltage.pdf"
lines = []
total_check = 0
with pdfplumber.open(file) as pdf:
pages = pdf.pages
for page in pdf.pages:
text = page.extract_text()
for line in text.split('\n'):
print(line)
使用上面的腳本我沒有得到正確的 output,對於時間列“AM”正在進入下一行。 我拿到的output是這樣的
它可以幫助您了解 pdf 的表面如何顯示到屏幕上。 這樣一串純文本就可以一部分一部分地顯示在顯示器上。 (這里我強調了第一個 AM 的放置位置。
作為一個附帶問題,我認為文件中的第一個 AM 乍一看編碼為這個塊
BT
/F1 12 Tf
1 0 0 1 224.20265 754.6322 Tm
[<001D001E>] TJ
ET
在該區域中 1D = A 和 1E = M
因此,如果您希望在顯示時提取每一行,到目前為止,最簡單的方法是使用 pdftotext 之類的庫,它專門輸出頁面上看到的每一行文本。
因此,使用表格逗號分隔等攻擊,您可以預期每個AM
都會有自己的行。 按邏輯應該是" ",AM," "," "
但是一些提取器應該說nan,AM,nan,nan
作為文本,它看起來像這樣來自一個可編程的行
pdftotext -layout "Battery Voltage.pdf"
這將 output "Battery Voltage.txt" 在同一個工作文件夾中
現在我們可以通過幾次點擊(不再)將 csv 連同 csv 包含的所有奇怪內容導出為“正確的輸出”。
,,Battery Vo,ltage,
Sr No,DateT,Ime,Voltage (v),Ignition
1,01/11/2022,00:08:10,47.15,Off
,AM,,,
2,01/11/2022,00:23:10,47.15,Off
,AM,,,
3,01/11/2022,00:38:10,47.15,Off
,AM,,,
4,01/11/2022,00:58:10,47.15,Off
,AM,,,
5,01/11/2022,01:18:10,47.15,Off
,AM,,,
6,01/11/2022,01:33:10,47.15,Off
,AM,,,
7,01/11/2022,01:48:10,47.15,Off
,AM,,,
8,01/11/2022,02:03:10,47.15,Off
,AM,,,
9,01/11/2022,02:18:10,47.15,Off
,AM,,,
10,01/11/2022,02:37:12,47.15,Off
,AM,,,
所以,如果在 csv 生成之前沒有完成編輯,那么在編輯器中發布過程會更簡單,就像這個 html 頁面(不需要更多應用程序)
,,Battery,Voltage,
Sr No,Date,Time,Voltage (v),Ignition
1,01/11/2022,00:08:10,47.15,Off,AM,,,
2,01/11/2022,00:23:10,47.15,Off,AM,,,
3,01/11/2022,00:38:10,47.15,Off,AM,,,
4,01/11/2022,00:58:10,47.15,Off,AM,,,
5,01/11/2022,01:18:10,47.15,Off,AM,,,
6,01/11/2022,01:33:10,47.15,Off,AM,,,
7,01/11/2022,01:48:10,47.15,Off,AM,,,
8,01/11/2022,02:03:10,47.15,Off,AM,,,
9,01/11/2022,02:18:10,47.15,Off,AM,,,
10,01/11/2022,02:37:12,47.15,Off,AM,,,
然后重新導入它看起來更像是人為生成的
在討論中,確認所需的只是一種結構化列表的方法,並首先使用pdftotext -layout -nopgbrk -x 0 -y 60 -W 800 -H 800 -fixed 6 "Battery Voltage.pdf" &type "battery voltage.txt"|findstr "O">battery.txt
將 output 規定的數據列用於框架,固定標題或拆分或以其他方式使用清潔數據。
1 01-11-2022 00:08:10 47.15 Off
2 01-11-2022 00:23:10 47.15 Off
3 01-11-2022 00:38:10 47.15 Off
4 01-11-2022 00:58:10 47.15 Off
5 01-11-2022 01:18:10 47.15 Off
...
32357 24-11-2022 17:48:43 45.40 On
32358 24-11-2022 17:48:52 44.51 On
32359 24-11-2022 17:48:55 44.51 On
32360 24-11-2022 17:48:58 44.51 On
32361 24-11-2022 17:48:58 44.51 On
這個階段我們可以使用文本處理如csv或者加上json括號
for /f "tokens=1,2,3,4,5 delims= " %%a In ('Findstr /C:"O" battery.txt') do echo csv is "%%a,%%b,%%c,%%d,%%e">output.txt
...
csv is "32357,24-11-2022,17:48:43,45.40,On"
csv is "32358,24-11-2022,17:48:52,44.51,On"
csv is "32359,24-11-2022,17:48:55,44.51,On"
csv is "32360,24-11-2022,17:48:58,44.51,On"
csv is "32361,24-11-2022,17:48:58,44.51,On"
所以請求是 JSON(不是我的強項,所以你可能需要改進我的代碼,因為我不知道 mongo 期望什么)
在這里,我將 pdf 放到 battery.bat 上
{"line_id":1,"created":{"date":"01-11-2022"},{"time":"00:08:10"},{"Voltage":"47.15"},{"State","Off"}}
{"line_id":2,"created":{"date":"01-11-2022"},{"time":"00:23:10"},{"Voltage":"47.15"},{"State","Off"}}
{"line_id":3,"created":{"date":"01-11-2022"},{"time":"00:38:10"},{"Voltage":"47.15"},{"State","Off"}}
{"line_id":4,"created":{"date":"01-11-2022"},{"time":"00:58:10"},{"Voltage":"47.15"},{"State","Off"}}
{"line_id":5,"created":{"date":"01-11-2022"},{"time":"01:18:10"},{"Voltage":"47.15"},{"State","Off"}}
{"line_id":6,"created":{"date":"01-11-2022"},{"time":"01:33:10"},{"Voltage":"47.15"},{"State","Off"}}
{"line_id":7,"created":{"date":"01-11-2022"},{"time":"01:48:10"},{"Voltage":"47.15"},{"State","Off"}}
{"line_id":8,"created":{"date":"01-11-2022"},{"time":"02:03:10"},{"Voltage":"47.15"},{"State","Off"}}
{"line_id":9,"created":{"date":"01-11-2022"},{"time":"02:18:10"},{"Voltage":"47.15"},{"State","Off"}}
{"line_id":10,"created":{"date":"01-11-2022"},{"time":"02:37:12"},{"Voltage":"47.15"},{"State","Off"}}
它在純控制台中運行有點慢,所以讓我們通過添加@來盲目地運行它,它仍然需要時間,因為我們在純文本中工作,所以預計 32,000 多行 = 2+1/2 分鍾的顯着延遲我的裝備
pdftotext -layout -nopgbrk -x 0 -y 60 -W 700 -H 800 -fixed 8 "%~1" battery.txt
echo Heading however you wish it for json perhaps just opener [ but note only one redirect chevron >"%~dpn1.txt"
for /f "tokens=1,2,3,4,5 delims= " %%a In ('Findstr /C:"O" battery.txt') do @echo "%%a": { "Date": "%%b", "Time": "%%c", "Voltage": %%d, "Ignition": "%%e" },>>"%~dpn1.txt"
REM another json style could be { "Line_Id": %%a, "Date": "%%b", "Time": "%%c", "Voltage": %%d, "Ignition": "%%e" },
REM another for an array can simply be [%%a,"%%b","%%c",%%d,"%%e" ],
echo Tailing however you wish it for json perhaps just final closer ] but note double chevron >>"%~dpn1.txt"
要查看進度, @echo {
更改為@echo %%a&echo {
對於這些情況,構建一個解析器,將不可用的數據轉換為您可以使用的數據。
下面的邏輯將該確切文件轉換為 CSV,但僅適用於該特定文件內容。
請注意,對於此特定文件,您可以忽略 AM/PM,因為時間采用 24 小時格式。
import pdfplumber
file = "Battery Voltage.pdf"
skiplines = [
"Battery Voltage",
"AM",
"PM",
"Sr No DateTIme Voltage (v) Ignition",
""
]
with open("output.csv", "w") as outfile:
header = "serialnumber;date;time;voltage;ignition\n"
outfile.write(header)
with pdfplumber.open(file) as pdf:
for page in pdf.pages:
for line in page.extract_text().split('\n'):
if line.strip() in skiplines:
continue
outfile.write(";".join(line.split())+"\n")
編輯
因此,python 中的 JSON 個文件基本上只是字典項的列表(是的,這過於簡單化了)。
您唯一需要更改的是實際處理線條的方式。 邏輯的實際內容沒有改變......
import pdfplumber
import json
file = "Battery Voltage.pdf"
skiplines = [
"Battery Voltage",
"AM",
"PM",
"Sr No DateTIme Voltage (v) Ignition",
""
]
result = []
with pdfplumber.open(file) as pdf:
for page in pdf.pages:
for line in page.extract_text().split("\n"):
if line.strip() in skiplines:
continue
serialnumber, date, time, voltage, ignition = line.split()
result.append(
{
"serialnumber": serialnumber,
"date": date,
"time": time,
"voltage": voltage,
"ignition": ignition,
}
)
with open("output.json", "w") as outfile:
json.dump(result, outfile)
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.