使用 python 在 .csv 文件中删除额外的逗号、空格和行偏移量

Question

我有一个5 页的 pdf 文件，每页都有一个我需要提取的表格。 我需要从每个页面中提取所有表格并使用 python 将它们保存为数据帧文件，所以我使用tabula将文件转换为 csv 文件

tabula.convert_into('input.pdf', "output.csv", output_format="csv", pages='all')

文件output.csv的主要问题是有几个额外的逗号。

例子

Id,Name,Age,,Score,Rang,Bonus
181,ALEX,,,,20,987
182,Julia,,,,18,8.390
183,Marian,,,,21,9.170
184,Julien,,0,175,60,9.095
Id,Name,Age,,Score,Rang,Bonus
215,Asma,26,,35,19,3.807
216,Juan,,,,20,7.982
217,Rami,,,,10,1.832
Id,Name,Age,,Score,Rang,Bonus
415,Jessica,,4 920,8 873,538,7.994
416,Karen,,890,6,12,9.993
417,Andrea,,0,69,283,7.200
Id,Name,Age,,Score,Rang,Bonus
419,Rym,10,,18,,10,7.196
420,Noor,10,,70,,910,8.291
421,Nathalie,0,,5,,0,0.900
"",Id,Name,Age,,Score,Rang,Bonus
456,,Joe,,10,13,0,74.917
457,,Loula,,0,18,11,9.990
458,,Maria,,0,15,172,6.425
459,,Carl,,15,17,11,3.349
Id,Name,Age,,Score,Rang,Bonus
566,Diego,,,,0,3.680
567,Carla,0,,26,1,19.361

当我将 csv 文件转换为行/列时，我得到了一些行偏移

检查下图以解决问题： 正如您在图像中看到的，有一些行偏移（文件的每一页中的每个表都有特定的行偏移）我该如何解决这个问题

注意：数据框应该有 6 列空字段。 我猜额外的逗号来自pdf文件中的空格。 如何从 csv 文件中删除多余的逗号或删除 pdf 文件上的多余空间。

下图中的预期输出：

我将衷心感谢您的帮助。

Answer 1

我发现这比Martin Evans 的回答更容易理解

它是生成器生成与清理后的第一行长度相同的行。 并删除第一个空字符串，直到一行具有正确的长度。

就像马丁的回答一样，它从您的示例数据中生成了您预期的数据框。

import pandas as pd
from io import StringIO
import csv

f = StringIO("""Id,Name,Age,,Score,Rang,Bonus
181,ALEX,,,,20,987
182,Julia,,,,18,8.390
183,Marian,,,,21,9.170
184,Julien,,0,175,60,9.095
Id,Name,Age,,Score,Rang,Bonus
215,Asma,26,,35,19,3.807
216,Juan,,,,20,7.982
217,Rami,,,,10,1.832
Id,Name,Age,,Score,Rang,Bonus
415,Jessica,,4 920,8 873,538,7.994
416,Karen,,890,6,12,9.993
417,Andrea,,0,69,283,7.200
Id,Name,Age,,Score,Rang,Bonus
419,Rym,10,,18,,10,7.196
420,Noor,10,,70,,910,8.291
421,Nathalie,0,,5,,0,0.900
"",Id,Name,Age,,Score,Rang,Bonus
456,,Joe,,10,13,0,74.917
457,,Loula,,0,18,11,9.990
458,,Maria,,0,15,172,6.425
459,,Carl,,15,17,11,3.349
Id,Name,Age,,Score,Rang,Bonus
566,Diego,,,,0,3.680
567,Carla,0,,26,1,19.361""")


def clean_up(csv_file):
    header = None
    for line in csv_file:
        if not header:
            header = [v for v in line if v]
            length = len(header)
            continue
        while len(line) > length:
            line.remove('')
        if line != header:
            yield(dict(zip(header,line)))

df = pd.DataFrame(clean_up(csv.reader(f)))
print(df)

这给了你：

     Id      Name    Age  Score Rang   Bonus
0   181      ALEX                 20     987
1   182     Julia                 18   8.390
2   183    Marian                 21   9.170
3   184    Julien      0    175   60   9.095
4   215      Asma     26     35   19   3.807
5   216      Juan                 20   7.982
6   217      Rami                 10   1.832
7   415   Jessica  4 920  8 873  538   7.994
8   416     Karen    890      6   12   9.993
9   417    Andrea      0     69  283   7.200
10  419       Rym     10     18   10   7.196
11  420      Noor     10     70  910   8.291
12  421  Nathalie      0      5    0   0.900
13  456       Joe     10     13    0  74.917
14  457     Loula      0     18   11   9.990
15  458     Maria      0     15  172   6.425
16  459      Carl     15     17   11   3.349
17  566     Diego                  0   3.680
18  567     Carla      0     26    1  19.361

Answer 2

我的策略基于一个简短的正则表达式来捕获前 2 列和最后的数字。

(\d+,[^,]+,) → numbers + comma + anything but comma + comma
,*           → zero or more commas
(\d.+)       → the rest of the line starting from the first number

然后我将这两个组连接起来，在中间插入足够的逗号，以便总数为 5（= 6 列）。

这对我来说似乎是一种非常简单的方法。 只要数字数据右对齐，它就适用于任何插入随机空格和逗号的输入变体。

import re,io

def fix_line(line):
    # remove duplicate commas and spaces 
    line = re.sub(',,', ',', line.replace(' ', ''))
    # groups: first two rows / middle (non-captured) / numbers
    match = re.match(r'(\d+,[^,]+,),*(\d.+)', line)
    if not match: # removes the headers
        return ''
    # align numbers to right: 6 columns = 5 commas
    return match.groups()[0]+(','*(5-2-match.groups()[1].count(',')))+match.groups()[1]
    

data_corr = [fix_line(line) for line in lines]

df = pd.read_csv(io.StringIO('\n'.join(data_corr)),
                 names=re.sub(',,+', ',', lines[0]).split(',') # assign column names
                )

假设此输入为变量lines ：

['Id,Name,Age,,Score,Rang,Bonus',
 '181,ALEX,,,,20,987',
 '182,Julia,,,,18,8.390',
 '183,Marian,,,,21,9.170',
 '184,Julien,,0,175,60,9.095',
 'Id,Name,Age,,Score,Rang,Bonus',
 '215,Asma,26,,35,19,3.807',
 '216,Juan,,,,20,7.982',
 '217,Rami,,,,10,1.832',
 'Id,Name,Age,,Score,Rang,Bonus',
 '415,Jessica,,4 920,8 873,538,7.994',
 '416,Karen,,890,6,12,9.993',
 '417,Andrea,,0,69,283,7.200',
 'Id,Name,Age,,Score,Rang,Bonus',
 '419,Rym,10,,18,,10,7.196',
 '420,Noor,10,,70,,910,8.291',
 '421,Nathalie,0,,5,,0,0.900',
 '"",Id,Name,Age,,Score,Rang,Bonus',
 '456,,Joe,,10,13,0,74.917',
 '457,,Loula,,0,18,11,9.990',
 '458,,Maria,,0,15,172,6.425',
 '459,,Carl,,15,17,11,3.349',
 'Id,Name,Age,,Score,Rang,Bonus',
 '566,Diego,,,,0,3.680',
 '567,Carla,0,,26,1,19.361']

输出：

     Id      Name     Age   Score  Rang    Bonus
0   181      ALEX     NaN     NaN    20  987.000
1   182     Julia     NaN     NaN    18    8.390
2   183    Marian     NaN     NaN    21    9.170
3   184    Julien     0.0   175.0    60    9.095
4   215      Asma    26.0    35.0    19    3.807
5   216      Juan     NaN     NaN    20    7.982
6   217      Rami     NaN     NaN    10    1.832
7   415   Jessica  4920.0  8873.0   538    7.994
8   416     Karen   890.0     6.0    12    9.993
9   417    Andrea     0.0    69.0   283    7.200
10  419       Rym    10.0    18.0    10    7.196
11  420      Noor    10.0    70.0   910    8.291
12  421  Nathalie     0.0     5.0     0    0.900
13  456       Joe    10.0    13.0     0   74.917
14  457     Loula     0.0    18.0    11    9.990
15  458     Maria     0.0    15.0   172    6.425
16  459      Carl    15.0    17.0    11    3.349
17  566     Diego     NaN     NaN     0    3.680
18  567     Carla     0.0    26.0     1   19.361

注意。 如果输入是文件，则首先使用以下方法读取行：

with open('/path/to/file', 'r') as f:
    lines = f.readlines()

Answer 3

将 CSV 内容加载到dataframe ，删除第三列，您将获得所需格式的数据。

注意：我没有在这里添加任何列名。 您可以在删除列后稍后添加它们

import pandas as pd

l = ['181,ALEX,,,,20,987', '182,Julia,,18,79,98,8.390', '183,Marian,,21,89,70,9.170', '184,Julien,,,,60,9.095']

df = pd.DataFrame([sub.split(",") for sub in l])

df.drop(2, inplace=True, axis=1)
print(df)

Output:

     0       1   3   4   5      6
0  181    ALEX          20    987
1  182   Julia  18  79  98  8.390
2  183  Marian  21  89  70  9.170
3  184  Julien          60  9.095

Answer 4

以下方法可能有效，但并不理想：

import pandas as pd
import csv

data = []

with open('output.csv') as f_input:
    csv_input = csv.reader(f_input)
    header = [v for v in next(csv_input) if v]      # Remove empty column names
    
    for row in csv_input:
        empty = row.index('')
        row = [v.replace(' ', '') for v in row if v]
        
        if row[0] != 'Id':
            row = row[:empty] + ['' for _ in range(6 - len(row))] + row[empty:]
            data.append(row)
        
df = pd.DataFrame(data, columns=header)
print(df)

给你：

     Id      Name   Age Score Rang   Bonus
0   181      ALEX               20     987
1   182     Julia               18   8.390
2   183    Marian               21   9.170
3   184    Julien     0   175   60   9.095
4   215      Asma    26    35   19   3.807
5   216      Juan               20   7.982
6   217      Rami               10   1.832
7   415   Jessica  4920  8873  538   7.994
8   416     Karen   890     6   12   9.993
9   417    Andrea     0    69  283   7.200
10  419       Rym    10    18   10   7.196
11  420      Noor    10    70  910   8.291
12  421  Nathalie     0     5    0   0.900
13  456       Joe    10    13    0  74.917
14  457     Loula     0    18   11   9.990
15  458     Maria     0    15  172   6.425
16  459      Carl    15    17   11   3.349
17  566     Diego                0   3.680
18  567     Carla     0    26    1  19.361

它的工作原理是删除所有空白条目，然后在第一个空白条目回到 6 个值之后填充剩余的条目。 由于年龄列似乎是可选的，因此它可能不是 100% 可靠的。

使用 python 在 .csv 文件中删除额外的逗号、空格和行偏移量

问题描述

4 个解决方案

解决方案1
1 2021-07-19 11:43:44

解决方案2
1 已采纳 2021-07-19 13:11:40

解决方案3
0 2021-07-16 14:56:27

解决方案4
0 2021-07-16 15:08:10

使用 python 在 .csv 文件中删除额外的逗号、空格和行偏移量

问题描述

4 个解决方案

解决方案1 1 2021-07-19 11:43:44

解决方案2 1 已采纳 2021-07-19 13:11:40

解决方案3 0 2021-07-16 14:56:27

解决方案4 0 2021-07-16 15:08:10

解决方案1
1 2021-07-19 11:43:44

解决方案2
1 已采纳 2021-07-19 13:11:40

解决方案3
0 2021-07-16 14:56:27

解决方案4
0 2021-07-16 15:08:10