[英]Remove extra commas, space & lines offset in .csv file using python
我有一个5 页的 pdf 文件,每页都有一个我需要提取的表格。 我需要从每个页面中提取所有表格并使用 python 将它们保存为数据帧文件,所以我使用tabula将文件转换为 csv 文件
tabula.convert_into('input.pdf', "output.csv", output_format="csv", pages='all')
文件output.csv的主要问题是有几个额外的逗号。
例子
Id,Name,Age,,Score,Rang,Bonus
181,ALEX,,,,20,987
182,Julia,,,,18,8.390
183,Marian,,,,21,9.170
184,Julien,,0,175,60,9.095
Id,Name,Age,,Score,Rang,Bonus
215,Asma,26,,35,19,3.807
216,Juan,,,,20,7.982
217,Rami,,,,10,1.832
Id,Name,Age,,Score,Rang,Bonus
415,Jessica,,4 920,8 873,538,7.994
416,Karen,,890,6,12,9.993
417,Andrea,,0,69,283,7.200
Id,Name,Age,,Score,Rang,Bonus
419,Rym,10,,18,,10,7.196
420,Noor,10,,70,,910,8.291
421,Nathalie,0,,5,,0,0.900
"",Id,Name,Age,,Score,Rang,Bonus
456,,Joe,,10,13,0,74.917
457,,Loula,,0,18,11,9.990
458,,Maria,,0,15,172,6.425
459,,Carl,,15,17,11,3.349
Id,Name,Age,,Score,Rang,Bonus
566,Diego,,,,0,3.680
567,Carla,0,,26,1,19.361
当我将 csv 文件转换为行/列时,我得到了一些行偏移
检查下图以解决问题: 正如您在图像中看到的,有一些行偏移(文件的每一页中的每个表都有特定的行偏移)我该如何解决这个问题
注意:数据框应该有 6 列空字段。 我猜额外的逗号来自pdf文件中的空格。 如何从 csv 文件中删除多余的逗号或删除 pdf 文件上的多余空间。
我将衷心感谢您的帮助。
我发现这比Martin Evans 的回答更容易理解
它是生成器生成与清理后的第一行长度相同的行。 并删除第一个空字符串,直到一行具有正确的长度。
就像马丁的回答一样,它从您的示例数据中生成了您预期的数据框。
import pandas as pd
from io import StringIO
import csv
f = StringIO("""Id,Name,Age,,Score,Rang,Bonus
181,ALEX,,,,20,987
182,Julia,,,,18,8.390
183,Marian,,,,21,9.170
184,Julien,,0,175,60,9.095
Id,Name,Age,,Score,Rang,Bonus
215,Asma,26,,35,19,3.807
216,Juan,,,,20,7.982
217,Rami,,,,10,1.832
Id,Name,Age,,Score,Rang,Bonus
415,Jessica,,4 920,8 873,538,7.994
416,Karen,,890,6,12,9.993
417,Andrea,,0,69,283,7.200
Id,Name,Age,,Score,Rang,Bonus
419,Rym,10,,18,,10,7.196
420,Noor,10,,70,,910,8.291
421,Nathalie,0,,5,,0,0.900
"",Id,Name,Age,,Score,Rang,Bonus
456,,Joe,,10,13,0,74.917
457,,Loula,,0,18,11,9.990
458,,Maria,,0,15,172,6.425
459,,Carl,,15,17,11,3.349
Id,Name,Age,,Score,Rang,Bonus
566,Diego,,,,0,3.680
567,Carla,0,,26,1,19.361""")
def clean_up(csv_file):
header = None
for line in csv_file:
if not header:
header = [v for v in line if v]
length = len(header)
continue
while len(line) > length:
line.remove('')
if line != header:
yield(dict(zip(header,line)))
df = pd.DataFrame(clean_up(csv.reader(f)))
print(df)
这给了你:
Id Name Age Score Rang Bonus
0 181 ALEX 20 987
1 182 Julia 18 8.390
2 183 Marian 21 9.170
3 184 Julien 0 175 60 9.095
4 215 Asma 26 35 19 3.807
5 216 Juan 20 7.982
6 217 Rami 10 1.832
7 415 Jessica 4 920 8 873 538 7.994
8 416 Karen 890 6 12 9.993
9 417 Andrea 0 69 283 7.200
10 419 Rym 10 18 10 7.196
11 420 Noor 10 70 910 8.291
12 421 Nathalie 0 5 0 0.900
13 456 Joe 10 13 0 74.917
14 457 Loula 0 18 11 9.990
15 458 Maria 0 15 172 6.425
16 459 Carl 15 17 11 3.349
17 566 Diego 0 3.680
18 567 Carla 0 26 1 19.361
我的策略基于一个简短的正则表达式来捕获前 2 列和最后的数字。
(\d+,[^,]+,) → numbers + comma + anything but comma + comma
,* → zero or more commas
(\d.+) → the rest of the line starting from the first number
然后我将这两个组连接起来,在中间插入足够的逗号,以便总数为 5(= 6 列)。
这对我来说似乎是一种非常简单的方法。 只要数字数据右对齐,它就适用于任何插入随机空格和逗号的输入变体。
import re,io
def fix_line(line):
# remove duplicate commas and spaces
line = re.sub(',,', ',', line.replace(' ', ''))
# groups: first two rows / middle (non-captured) / numbers
match = re.match(r'(\d+,[^,]+,),*(\d.+)', line)
if not match: # removes the headers
return ''
# align numbers to right: 6 columns = 5 commas
return match.groups()[0]+(','*(5-2-match.groups()[1].count(',')))+match.groups()[1]
data_corr = [fix_line(line) for line in lines]
df = pd.read_csv(io.StringIO('\n'.join(data_corr)),
names=re.sub(',,+', ',', lines[0]).split(',') # assign column names
)
假设此输入为变量lines
:
['Id,Name,Age,,Score,Rang,Bonus',
'181,ALEX,,,,20,987',
'182,Julia,,,,18,8.390',
'183,Marian,,,,21,9.170',
'184,Julien,,0,175,60,9.095',
'Id,Name,Age,,Score,Rang,Bonus',
'215,Asma,26,,35,19,3.807',
'216,Juan,,,,20,7.982',
'217,Rami,,,,10,1.832',
'Id,Name,Age,,Score,Rang,Bonus',
'415,Jessica,,4 920,8 873,538,7.994',
'416,Karen,,890,6,12,9.993',
'417,Andrea,,0,69,283,7.200',
'Id,Name,Age,,Score,Rang,Bonus',
'419,Rym,10,,18,,10,7.196',
'420,Noor,10,,70,,910,8.291',
'421,Nathalie,0,,5,,0,0.900',
'"",Id,Name,Age,,Score,Rang,Bonus',
'456,,Joe,,10,13,0,74.917',
'457,,Loula,,0,18,11,9.990',
'458,,Maria,,0,15,172,6.425',
'459,,Carl,,15,17,11,3.349',
'Id,Name,Age,,Score,Rang,Bonus',
'566,Diego,,,,0,3.680',
'567,Carla,0,,26,1,19.361']
输出:
Id Name Age Score Rang Bonus
0 181 ALEX NaN NaN 20 987.000
1 182 Julia NaN NaN 18 8.390
2 183 Marian NaN NaN 21 9.170
3 184 Julien 0.0 175.0 60 9.095
4 215 Asma 26.0 35.0 19 3.807
5 216 Juan NaN NaN 20 7.982
6 217 Rami NaN NaN 10 1.832
7 415 Jessica 4920.0 8873.0 538 7.994
8 416 Karen 890.0 6.0 12 9.993
9 417 Andrea 0.0 69.0 283 7.200
10 419 Rym 10.0 18.0 10 7.196
11 420 Noor 10.0 70.0 910 8.291
12 421 Nathalie 0.0 5.0 0 0.900
13 456 Joe 10.0 13.0 0 74.917
14 457 Loula 0.0 18.0 11 9.990
15 458 Maria 0.0 15.0 172 6.425
16 459 Carl 15.0 17.0 11 3.349
17 566 Diego NaN NaN 0 3.680
18 567 Carla 0.0 26.0 1 19.361
注意。 如果输入是文件,则首先使用以下方法读取行:
with open('/path/to/file', 'r') as f:
lines = f.readlines()
将 CSV 内容加载到dataframe
,删除第三列,您将获得所需格式的数据。
注意:我没有在这里添加任何列名。 您可以在删除列后稍后添加它们
import pandas as pd
l = ['181,ALEX,,,,20,987', '182,Julia,,18,79,98,8.390', '183,Marian,,21,89,70,9.170', '184,Julien,,,,60,9.095']
df = pd.DataFrame([sub.split(",") for sub in l])
df.drop(2, inplace=True, axis=1)
print(df)
Output:
0 1 3 4 5 6
0 181 ALEX 20 987
1 182 Julia 18 79 98 8.390
2 183 Marian 21 89 70 9.170
3 184 Julien 60 9.095
以下方法可能有效,但并不理想:
import pandas as pd
import csv
data = []
with open('output.csv') as f_input:
csv_input = csv.reader(f_input)
header = [v for v in next(csv_input) if v] # Remove empty column names
for row in csv_input:
empty = row.index('')
row = [v.replace(' ', '') for v in row if v]
if row[0] != 'Id':
row = row[:empty] + ['' for _ in range(6 - len(row))] + row[empty:]
data.append(row)
df = pd.DataFrame(data, columns=header)
print(df)
给你:
Id Name Age Score Rang Bonus
0 181 ALEX 20 987
1 182 Julia 18 8.390
2 183 Marian 21 9.170
3 184 Julien 0 175 60 9.095
4 215 Asma 26 35 19 3.807
5 216 Juan 20 7.982
6 217 Rami 10 1.832
7 415 Jessica 4920 8873 538 7.994
8 416 Karen 890 6 12 9.993
9 417 Andrea 0 69 283 7.200
10 419 Rym 10 18 10 7.196
11 420 Noor 10 70 910 8.291
12 421 Nathalie 0 5 0 0.900
13 456 Joe 10 13 0 74.917
14 457 Loula 0 18 11 9.990
15 458 Maria 0 15 172 6.425
16 459 Carl 15 17 11 3.349
17 566 Diego 0 3.680
18 567 Carla 0 26 1 19.361
它的工作原理是删除所有空白条目,然后在第一个空白条目回到 6 个值之后填充剩余的条目。 由于年龄列似乎是可选的,因此它可能不是 100% 可靠的。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.