[英]Read CSV into a dataFrame with varying row lengths using Pandas
所以我有一個 CSV 看起來有點像這樣:
1 | 01-01-2019 | 724
2 | 01-01-2019 | 233 | 436
3 | 01-01-2019 | 345
4 | 01-01-2019 | 803 | 933 | 943 | 923 | 954
5 | 01-01-2019 | 454
...
當我嘗試使用以下代碼生成數據幀時..
df = pd.read_csv('data.csv', header=0, engine='c', error_bad_lines=False)
它只將 3 列的行添加到 df(上面的第 1、3 和 5 行)
其余的被認為是“壞線”給我以下錯誤:
Skipping line 17467: expected 3 fields, saw 9
如何創建一個包含我的 csv 中所有數據的數據框,可能只是用 null 填充空單元格? 或者我是否必須在添加到 df 之前聲明最大行長度?
謝謝!
如果僅使用pandas
,請pandas
閱讀,然后處理分隔符。
import pandas as pd
df = pd.read_csv('data.csv', header=None, sep='\n')
df = df[0].str.split('\s\|\s', expand=True)
0 1 2 3 4 5 6
0 1 01-01-2019 724 None None None None
1 2 01-01-2019 233 436 None None None
2 3 01-01-2019 345 None None None None
3 4 01-01-2019 803 933 943 923 954
4 5 01-01-2019 454 None None None None
如果您知道數據包含N
列,您可以通過names
參數提前告訴 Pandas 需要多少列:
import pandas as pd
df = pd.read_csv('data', delimiter='|', names=list(range(7)))
print(df)
產量
0 1 2 3 4 5 6
0 1 01-01-2019 724 NaN NaN NaN NaN
1 2 01-01-2019 233 436.0 NaN NaN NaN
2 3 01-01-2019 345 NaN NaN NaN NaN
3 4 01-01-2019 803 933.0 943.0 923.0 954.0
4 5 01-01-2019 454 NaN NaN NaN NaN
如果您有列數的上限N
,那么您可以讓 Pandas 讀取N
列,然后使用dropna
刪除完全空的列:
import pandas as pd
df = pd.read_csv('data', delimiter='|', names=list(range(20))).dropna(axis='columns', how='all')
print(df)
產量
0 1 2 3 4 5 6
0 1 01-01-2019 724 NaN NaN NaN NaN
1 2 01-01-2019 233 436.0 NaN NaN NaN
2 3 01-01-2019 345 NaN NaN NaN NaN
3 4 01-01-2019 803 933.0 943.0 923.0 954.0
4 5 01-01-2019 454 NaN NaN NaN NaN
請注意,如果它們完全為空,這可能會從數據集的中間刪除列(不僅僅是右側的列)。
讀取固定寬度應該有效:
from io import StringIO
s = '''1 01-01-2019 724
2 01-01-2019 233 436
3 01-01-2019 345
4 01-01-2019 803 933 943 923 954
5 01-01-2019 454'''
pd.read_fwf(StringIO(s), header=None)
0 1 2 3 4 5 6
0 1 01-01-2019 724 NaN NaN NaN NaN
1 2 01-01-2019 233 436.0 NaN NaN NaN
2 3 01-01-2019 345 NaN NaN NaN NaN
3 4 01-01-2019 803 933.0 943.0 923.0 954.0
4 5 01-01-2019 454 NaN NaN NaN NaN
或帶有delimiter
參數
s = '''1 | 01-01-2019 | 724
2 | 01-01-2019 | 233 | 436
3 | 01-01-2019 | 345
4 | 01-01-2019 | 803 | 933 | 943 | 923 | 954
5 | 01-01-2019 | 454'''
pd.read_fwf(StringIO(s), header=None, delimiter='|')
0 1 2 3 4 5 6
0 1 01-01-2019 724 NaN NaN NaN NaN
1 2 01-01-2019 233 436.0 NaN NaN NaN
2 3 01-01-2019 345 NaN NaN NaN NaN
3 4 01-01-2019 803 933.0 943.0 923.0 954.0
4 5 01-01-2019 454 NaN NaN NaN NaN
請注意,對於您的實際文件,您不會使用StringIO
您只需將其替換為您的文件路徑: pd.read_fwf('data.csv', delimiter='|', header=None)
在 csv 文件的頂部添加額外的列(空或其他)。 Pandas 將第一行作為默認大小,它下面的任何內容都將具有 NaN 值。 例子:
文件.csv:
a,b,c,d,e
1,2,3
3
2,3,4
代碼:
>>> import pandas as pd
>>> pd.read_csv('file.csv')
a b c d e
0 1 2.0 3.0 NaN NaN
1 3 NaN NaN NaN NaN
2 2 3.0 4.0 NaN NaN
考慮使用 Python csv
來完成導入數據和格式整理的工作。 您可以實現自定義方言來處理不同的 csv-ness。
import csv
import pandas as pd
csv_data = """1 | 01-01-2019 | 724
2 | 01-01-2019 | 233 | 436
3 | 01-01-2019 | 345
4 | 01-01-2019 | 803 | 933 | 943 | 923 | 954
5 | 01-01-2019 | 454"""
with open('test1.csv', 'w') as f:
f.write(csv_data)
csv.register_dialect('PipeDialect', delimiter='|')
with open('test1.csv') as csvfile:
data = [row for row in csv.reader(csvfile, 'PipeDialect')]
df = pd.DataFrame(data = data)
為您提供 csv 導入方言和以下 DataFrame:
0 1 2 3 4 5 6
0 1 01-01-2019 724 None None None None
1 2 01-01-2019 233 436 None None None
2 3 01-01-2019 345 None None None None
3 4 01-01-2019 803 933 943 923 954
4 5 01-01-2019 454 None None None None
剩下的練習是處理輸入文件中的空白填充。
colnames= [str(i) for i in range(9)]
df = pd.read_table('data.csv', header=None, sep=',', names=colnames)
如果代碼給出錯誤,則將列名中的9
更改為數字x
Skipping line 17467: expected 3 fields, saw x
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.