[英]Read CSV into a dataFrame with varying row lengths using Pandas
所以我有一个 CSV 看起来有点像这样:
1 | 01-01-2019 | 724
2 | 01-01-2019 | 233 | 436
3 | 01-01-2019 | 345
4 | 01-01-2019 | 803 | 933 | 943 | 923 | 954
5 | 01-01-2019 | 454
...
当我尝试使用以下代码生成数据帧时..
df = pd.read_csv('data.csv', header=0, engine='c', error_bad_lines=False)
它只将 3 列的行添加到 df(上面的第 1、3 和 5 行)
其余的被认为是“坏线”给我以下错误:
Skipping line 17467: expected 3 fields, saw 9
如何创建一个包含我的 csv 中所有数据的数据框,可能只是用 null 填充空单元格? 或者我是否必须在添加到 df 之前声明最大行长度?
谢谢!
如果仅使用pandas
,请pandas
阅读,然后处理分隔符。
import pandas as pd
df = pd.read_csv('data.csv', header=None, sep='\n')
df = df[0].str.split('\s\|\s', expand=True)
0 1 2 3 4 5 6
0 1 01-01-2019 724 None None None None
1 2 01-01-2019 233 436 None None None
2 3 01-01-2019 345 None None None None
3 4 01-01-2019 803 933 943 923 954
4 5 01-01-2019 454 None None None None
如果您知道数据包含N
列,您可以通过names
参数提前告诉 Pandas 需要多少列:
import pandas as pd
df = pd.read_csv('data', delimiter='|', names=list(range(7)))
print(df)
产量
0 1 2 3 4 5 6
0 1 01-01-2019 724 NaN NaN NaN NaN
1 2 01-01-2019 233 436.0 NaN NaN NaN
2 3 01-01-2019 345 NaN NaN NaN NaN
3 4 01-01-2019 803 933.0 943.0 923.0 954.0
4 5 01-01-2019 454 NaN NaN NaN NaN
如果您有列数的上限N
,那么您可以让 Pandas 读取N
列,然后使用dropna
删除完全空的列:
import pandas as pd
df = pd.read_csv('data', delimiter='|', names=list(range(20))).dropna(axis='columns', how='all')
print(df)
产量
0 1 2 3 4 5 6
0 1 01-01-2019 724 NaN NaN NaN NaN
1 2 01-01-2019 233 436.0 NaN NaN NaN
2 3 01-01-2019 345 NaN NaN NaN NaN
3 4 01-01-2019 803 933.0 943.0 923.0 954.0
4 5 01-01-2019 454 NaN NaN NaN NaN
请注意,如果它们完全为空,这可能会从数据集的中间删除列(不仅仅是右侧的列)。
读取固定宽度应该有效:
from io import StringIO
s = '''1 01-01-2019 724
2 01-01-2019 233 436
3 01-01-2019 345
4 01-01-2019 803 933 943 923 954
5 01-01-2019 454'''
pd.read_fwf(StringIO(s), header=None)
0 1 2 3 4 5 6
0 1 01-01-2019 724 NaN NaN NaN NaN
1 2 01-01-2019 233 436.0 NaN NaN NaN
2 3 01-01-2019 345 NaN NaN NaN NaN
3 4 01-01-2019 803 933.0 943.0 923.0 954.0
4 5 01-01-2019 454 NaN NaN NaN NaN
或带有delimiter
参数
s = '''1 | 01-01-2019 | 724
2 | 01-01-2019 | 233 | 436
3 | 01-01-2019 | 345
4 | 01-01-2019 | 803 | 933 | 943 | 923 | 954
5 | 01-01-2019 | 454'''
pd.read_fwf(StringIO(s), header=None, delimiter='|')
0 1 2 3 4 5 6
0 1 01-01-2019 724 NaN NaN NaN NaN
1 2 01-01-2019 233 436.0 NaN NaN NaN
2 3 01-01-2019 345 NaN NaN NaN NaN
3 4 01-01-2019 803 933.0 943.0 923.0 954.0
4 5 01-01-2019 454 NaN NaN NaN NaN
请注意,对于您的实际文件,您不会使用StringIO
您只需将其替换为您的文件路径: pd.read_fwf('data.csv', delimiter='|', header=None)
在 csv 文件的顶部添加额外的列(空或其他)。 Pandas 将第一行作为默认大小,它下面的任何内容都将具有 NaN 值。 例子:
文件.csv:
a,b,c,d,e
1,2,3
3
2,3,4
代码:
>>> import pandas as pd
>>> pd.read_csv('file.csv')
a b c d e
0 1 2.0 3.0 NaN NaN
1 3 NaN NaN NaN NaN
2 2 3.0 4.0 NaN NaN
考虑使用 Python csv
来完成导入数据和格式整理的工作。 您可以实现自定义方言来处理不同的 csv-ness。
import csv
import pandas as pd
csv_data = """1 | 01-01-2019 | 724
2 | 01-01-2019 | 233 | 436
3 | 01-01-2019 | 345
4 | 01-01-2019 | 803 | 933 | 943 | 923 | 954
5 | 01-01-2019 | 454"""
with open('test1.csv', 'w') as f:
f.write(csv_data)
csv.register_dialect('PipeDialect', delimiter='|')
with open('test1.csv') as csvfile:
data = [row for row in csv.reader(csvfile, 'PipeDialect')]
df = pd.DataFrame(data = data)
为您提供 csv 导入方言和以下 DataFrame:
0 1 2 3 4 5 6
0 1 01-01-2019 724 None None None None
1 2 01-01-2019 233 436 None None None
2 3 01-01-2019 345 None None None None
3 4 01-01-2019 803 933 943 923 954
4 5 01-01-2019 454 None None None None
剩下的练习是处理输入文件中的空白填充。
colnames= [str(i) for i in range(9)]
df = pd.read_table('data.csv', header=None, sep=',', names=colnames)
如果代码给出错误,则将列名中的9
更改为数字x
Skipping line 17467: expected 3 fields, saw x
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.