简体   繁体   English

Pandas CSV文件,中间偶尔会有额外的列

[英]Pandas CSV file with occasional extra columns in the middle

I'm processing lots (thousands) of ~100k line csv files that are produced by someone else. 我正在处理大量(数千)~100k行csv文件,这些文件是由其他人生成的。 9 times out of 10 the files have 8 columns and all is right with the world. 10个文件中有9个文件有8列,所有文件都是正确的。 The 10th time or so ~10 lines will have 2 extra columns inserted after column 6: (For simplicity lets assume the values in all the rows have the same value.) 第10行~10行将在第6列之后插入2个额外的列:(为简单起见,假设所有行中的值具有相同的值。)

A,B,C,D,E,F,G,H
A,B,C,D,E,F,G,H
A,B,C,D,E,F,Foo,Bar,G,H
A,B,C,D,E,F,G,H
A,B,C,D,E,F,Foo,Bar,G,H
A,B,C,D,E,F,G,H
A,B,C,D,E,F,G,H

I don't have control over the generation of the data files and need to clean them on my end, but I believe that rows with extra columns have corrupted data so I just want to reject them for now. 我无法控制数据文件的生成,需要在我的最后清理它们,但我相信带有额外列的行会损坏数据,所以我现在只想拒绝它们。 I figured a simple way to handle this would be to initially load my data into a 10 column DataFrame: 我认为处理这个问题的简单方法是将我的数据初始加载到10列DataFrame中:

In [100]: df = pd.read_csv(data_dir + data_file, names=ColumnNames)

In [101]: data_df
Out[101]: 
<class 'pandas.core.frame.DataFrame'>
Int64Index: 99531 entries, 0 to 99530
Data columns:
time             99531  non-null values
var1             99531  non-null values
var2             99531  non-null values
var3             99531  non-null values
var4             99531  non-null values
var5             99531  non-null values
var6             98386  non-null values
var7             29829  non-null values
extra1           10  non-null values
extra2           10  non-null values
dtypes: float64(3), int64(5), object(2)

And then check for where extra1 or extra2 isnull, keep those rows, and then drop the extra rows. 然后检查extra1或extra2 isnull的位置,保留这些行,然后删除多余的行。

data_df = data_df[pd.isnull(data_df['extra1']) & pd.isnull(data_df['extra2'])]
del data_df['extra1']
del data_df['extra2']

This seems a little round about / non-ideal. 这似乎有点圆/非理想。 Does anyone have a better idea of how to clean this? 有没有人更清楚如何清理这个?

Thanks 谢谢

If you want to drop the bad lines, you might be able to use error_bad_lines=False (and warn_bad_lines = False if you want it to be quiet about it): 如果你想删除坏行,你可以使用error_bad_lines=False (如果你希望它安静一下,则warn_bad_lines = False ):

>>> !cat unclean.csv
A,B,C,D,E,F,G,H
A,B,C,D,E,F,G,H
A,B,C,D,E,F,Foo,Bar,G,H
A,B,C,D,E,F,G,H
A,B,C,D,E,F,Foo,Bar,G,H
A,B,C,D,E,F,G,H
A,B,C,D,E,F,G,H
>>> df = pd.read_csv("unclean.csv", error_bad_lines=False, header=None)
Skipping line 3: expected 8 fields, saw 10
Skipping line 5: expected 8 fields, saw 10

>>> df
   0  1  2  3  4  5  6  7
0  A  B  C  D  E  F  G  H
1  A  B  C  D  E  F  G  H
2  A  B  C  D  E  F  G  H
3  A  B  C  D  E  F  G  H
4  A  B  C  D  E  F  G  H

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM