简体   繁体   English

将键值对读入Pandas

[英]Reading key-value pairs into Pandas

Pandas makes it really easy to read a CSV file: Pandas使得读取CSV文件非常容易:

pd.read_table('data.txt', sep=',')

Does Pandas having something similar for a file with key-value pairs? 对于具有键值对的文件,Pandas是否具有类似的功能? I came-up with this: 我想出了这个:

pd.DataFrame([dict([p.split('=') for p in l.split(',')]) for l in open('data.txt')])

If not built-in, then perhaps something more idiomatic? 如果不是内置的,那么也许更惯用了吗?

The file of interest looks like this: 感兴趣的文件如下所示:

symbol=ESM3,exchange=GLOBEX,timestamp=1365428525690751,price=1548.00,quantity=551
symbol=ESM3,exchange=GLOBEX,timestamp=1365428525697183,price=1548.00,quantity=551
symbol=ESM3,exchange=GLOBEX,timestamp=1365428525714498,price=1548.00,quantity=551
symbol=ESM3,exchange=GLOBEX,timestamp=1365428525734967,price=1548.00,quantity=551
symbol=ESM3,exchange=GLOBEX,timestamp=1365428525735567,price=1548.00,quantity=555
symbol=ESM3,exchange=GLOBEX,timestamp=1365428525735585,price=1548.00,quantity=556
symbol=ESM3,exchange=GLOBEX,timestamp=1365428525736116,price=1548.00,quantity=556
symbol=ESM3,exchange=GLOBEX,timestamp=1365428525740757,price=1548.00,quantity=556
symbol=ESM3,exchange=GLOBEX,timestamp=1365428525748502,price=1548.00,quantity=556
symbol=ESM3,exchange=GLOBEX,timestamp=1365428525748952,price=1548.00,quantity=557

It has the exact same keys on every line, and in the same order. 它在每一行上具有完全相同的键,并且顺序相同。 There are no null values. 没有空值。 The table to be generated is: 要生成的表是:

  exchange    price quantity symbol         timestamp
0   GLOBEX  1548.00    551\n   ESM3  1365428525690751
1   GLOBEX  1548.00    551\n   ESM3  1365428525697183
2   GLOBEX  1548.00    551\n   ESM3  1365428525714498
3   GLOBEX  1548.00    551\n   ESM3  1365428525734967
4   GLOBEX  1548.00    555\n   ESM3  1365428525735567
5   GLOBEX  1548.00    556\n   ESM3  1365428525735585
6   GLOBEX  1548.00    556\n   ESM3  1365428525736116
7   GLOBEX  1548.00    556\n   ESM3  1365428525740757
8   GLOBEX  1548.00    556\n   ESM3  1365428525748502
9   GLOBEX  1548.00    557\n   ESM3  1365428525748952

(I can remove the \\n from quantity with an rstrip() after I've brought it in.) (将\\n带入后,可以使用rstrip()quantity删除\\n 。)

If you know the key names beforehand and if the names always appear in the same order, then you could use a converter to chop off the key names, and then use the names parameter to name the columns: 如果您事先知道键名,并且名称始终以相同的顺序出现,则可以使用转换器将键名砍掉,然后使用names参数来命名列:

import pandas as pd

def value(item):
    return item[item.find('=')+1:]

df = pd.read_table('data.txt', header=None, delimiter=',',
                   converters={i:value for i in range(5)},
                   names='symbol exchange timestamp price quantity'.split())
print(df)

on your posted data yields 您发布的数据收益

  symbol exchange         timestamp    price quantity
0   ESM3   GLOBEX  1365428525690751  1548.00      551
1   ESM3   GLOBEX  1365428525697183  1548.00      551
2   ESM3   GLOBEX  1365428525714498  1548.00      551
3   ESM3   GLOBEX  1365428525734967  1548.00      551
4   ESM3   GLOBEX  1365428525735567  1548.00      555
5   ESM3   GLOBEX  1365428525735585  1548.00      556
6   ESM3   GLOBEX  1365428525736116  1548.00      556
7   ESM3   GLOBEX  1365428525740757  1548.00      556
8   ESM3   GLOBEX  1365428525748502  1548.00      556
9   ESM3   GLOBEX  1365428525748952  1548.00      557

I'm not sure what the best way to do this is, but assuming that the delimiters aren't found in the values -- it hurts my brain to think of the corner cases -- then something like this isn't super-elegant but is straightforward: 我不确定执行此操作的最佳方法是什么,但是假设在值中未找到定界符-考虑到极端情况会伤及我的大脑-那么类似的事情并不是超级优雅但很简单:

>>> df = pd.read_csv("esm.csv", sep=",|=", header=None)
>>> df2 = df.ix[:,1::2]
>>> df2.columns = list(df.ix[0,0::2])
>>> df2
  symbol exchange         timestamp  price  quantity
0   ESM3   GLOBEX  1365428525690751   1548       551
1   ESM3   GLOBEX  1365428525697183   1548       551
2   ESM3   GLOBEX  1365428525714498   1548       551
3   ESM3   GLOBEX  1365428525734967   1548       551
4   ESM3   GLOBEX  1365428525735567   1548       555
5   ESM3   GLOBEX  1365428525735585   1548       556
6   ESM3   GLOBEX  1365428525736116   1548       556
7   ESM3   GLOBEX  1365428525740757   1548       556
8   ESM3   GLOBEX  1365428525748502   1548       556
9   ESM3   GLOBEX  1365428525748952   1548       557

Basically, read it in, and then do the pivot yourself, keeping every other element and then fixing the column names. 基本上,请先阅读它,然后自己进行数据透视,保留所有其他元素,然后固定列名。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM