[英]Reading key-value pairs into Pandas
Pandas使得讀取CSV文件非常容易:
pd.read_table('data.txt', sep=',')
對於具有鍵值對的文件,Pandas是否具有類似的功能? 我想出了這個:
pd.DataFrame([dict([p.split('=') for p in l.split(',')]) for l in open('data.txt')])
如果不是內置的,那么也許更慣用了嗎?
感興趣的文件如下所示:
symbol=ESM3,exchange=GLOBEX,timestamp=1365428525690751,price=1548.00,quantity=551
symbol=ESM3,exchange=GLOBEX,timestamp=1365428525697183,price=1548.00,quantity=551
symbol=ESM3,exchange=GLOBEX,timestamp=1365428525714498,price=1548.00,quantity=551
symbol=ESM3,exchange=GLOBEX,timestamp=1365428525734967,price=1548.00,quantity=551
symbol=ESM3,exchange=GLOBEX,timestamp=1365428525735567,price=1548.00,quantity=555
symbol=ESM3,exchange=GLOBEX,timestamp=1365428525735585,price=1548.00,quantity=556
symbol=ESM3,exchange=GLOBEX,timestamp=1365428525736116,price=1548.00,quantity=556
symbol=ESM3,exchange=GLOBEX,timestamp=1365428525740757,price=1548.00,quantity=556
symbol=ESM3,exchange=GLOBEX,timestamp=1365428525748502,price=1548.00,quantity=556
symbol=ESM3,exchange=GLOBEX,timestamp=1365428525748952,price=1548.00,quantity=557
它在每一行上具有完全相同的鍵,並且順序相同。 沒有空值。 要生成的表是:
exchange price quantity symbol timestamp
0 GLOBEX 1548.00 551\n ESM3 1365428525690751
1 GLOBEX 1548.00 551\n ESM3 1365428525697183
2 GLOBEX 1548.00 551\n ESM3 1365428525714498
3 GLOBEX 1548.00 551\n ESM3 1365428525734967
4 GLOBEX 1548.00 555\n ESM3 1365428525735567
5 GLOBEX 1548.00 556\n ESM3 1365428525735585
6 GLOBEX 1548.00 556\n ESM3 1365428525736116
7 GLOBEX 1548.00 556\n ESM3 1365428525740757
8 GLOBEX 1548.00 556\n ESM3 1365428525748502
9 GLOBEX 1548.00 557\n ESM3 1365428525748952
(將\\n
帶入后,可以使用rstrip()
從quantity
刪除\\n
。)
如果您事先知道鍵名,並且名稱始終以相同的順序出現,則可以使用轉換器將鍵名砍掉,然后使用names
參數來命名列:
import pandas as pd
def value(item):
return item[item.find('=')+1:]
df = pd.read_table('data.txt', header=None, delimiter=',',
converters={i:value for i in range(5)},
names='symbol exchange timestamp price quantity'.split())
print(df)
您發布的數據收益
symbol exchange timestamp price quantity
0 ESM3 GLOBEX 1365428525690751 1548.00 551
1 ESM3 GLOBEX 1365428525697183 1548.00 551
2 ESM3 GLOBEX 1365428525714498 1548.00 551
3 ESM3 GLOBEX 1365428525734967 1548.00 551
4 ESM3 GLOBEX 1365428525735567 1548.00 555
5 ESM3 GLOBEX 1365428525735585 1548.00 556
6 ESM3 GLOBEX 1365428525736116 1548.00 556
7 ESM3 GLOBEX 1365428525740757 1548.00 556
8 ESM3 GLOBEX 1365428525748502 1548.00 556
9 ESM3 GLOBEX 1365428525748952 1548.00 557
我不確定執行此操作的最佳方法是什么,但是假設在值中未找到定界符-考慮到極端情況會傷及我的大腦-那么類似的事情並不是超級優雅但很簡單:
>>> df = pd.read_csv("esm.csv", sep=",|=", header=None)
>>> df2 = df.ix[:,1::2]
>>> df2.columns = list(df.ix[0,0::2])
>>> df2
symbol exchange timestamp price quantity
0 ESM3 GLOBEX 1365428525690751 1548 551
1 ESM3 GLOBEX 1365428525697183 1548 551
2 ESM3 GLOBEX 1365428525714498 1548 551
3 ESM3 GLOBEX 1365428525734967 1548 551
4 ESM3 GLOBEX 1365428525735567 1548 555
5 ESM3 GLOBEX 1365428525735585 1548 556
6 ESM3 GLOBEX 1365428525736116 1548 556
7 ESM3 GLOBEX 1365428525740757 1548 556
8 ESM3 GLOBEX 1365428525748502 1548 556
9 ESM3 GLOBEX 1365428525748952 1548 557
基本上,請先閱讀它,然后自己進行數據透視,保留所有其他元素,然后固定列名。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.