[英]pandas.read_csv: how do I parse two columns as datetimes in a hierarchically-indexed CSV?
I have a CSV file that, simplified, looks like this:我有一个 CSV 文件,简化后如下所示:
X,,Y,,Z,
Date,Time,A,B,A,B
2017-01-21,01:57:49.390,0,1,2,3
2017-01-21,01:57:50.400,4,5,7,9
2017-01-21,01:57:51.410,3,2,4,1
The first two columns are date and time.前两列是日期和时间。 When I do"
当我做”
pandas.read_csv('foo.csv', header=[0,1])
I get the following DataFrame:我得到以下数据帧:
X Unnamed: 1_level_0 Y Unnamed: 3_level_0 Z Unnamed: 5_level_0
Date Time A B A B
0 2017-01-21 01:57:49.390 0 1 2 3
1 2017-01-21 01:57:50.400 4 5 7 9
2 2017-01-21 01:57:51.410 3 2 4 1
Ignoring the annoying unnamed entries in the columns for now, I'd like to combine the first two columns into a single datetime.暂时忽略列中烦人的未命名条目,我想将前两列合并为一个日期时间。 So I tried using the parse_dates argument:
所以我尝试使用 parse_dates 参数:
pandas.read_csv('foo.csv', header=[0,1], parse_dates={'datetime': [0,1]})
But all I get from this is a traceback:但我从中得到的只是一个追溯:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python2.7/dist-packages/pandas/io/parsers.py", line 646, in parser_f
return _read(filepath_or_buffer, kwds)
File "/usr/local/lib/python2.7/dist-packages/pandas/io/parsers.py", line 401, in _read
data = parser.read()
File "/usr/local/lib/python2.7/dist-packages/pandas/io/parsers.py", line 939, in read
ret = self._engine.read(nrows)
File "/usr/local/lib/python2.7/dist-packages/pandas/io/parsers.py", line 1585, in read
names, data = self._do_date_conversions(names, data)
File "/usr/local/lib/python2.7/dist-packages/pandas/io/parsers.py", line 1364, in _do_date_conversions
self.index_names, names, keep_date_col=self.keep_date_col)
File "/usr/local/lib/python2.7/dist-packages/pandas/io/parsers.py", line 2737, in _process_date_conversion
data_dict.pop(c)
KeyError: "('X', 'Date')"
I'm not sure why it's hitting a KeyError
on ('X', 'Date')
, since those are definitely present in the columns.我不确定为什么它会在
('X', 'Date')
上遇到KeyError
,因为这些肯定存在于列中。 I don't really know if this is a bug in pandas
that I should report (I'm using 0.19.2), or if I'm just not understanding something.我真的不知道这是否是我应该报告的
pandas
中的错误(我使用的是 0.19.2),或者我只是不理解某些东西。 Any ideas?有任何想法吗?
You can work around if needed:如果需要,您可以解决:
import datetime as dt
import pandas as pd
# read in the csv file
df = pd.read_csv('foo.csv', header=[0, 1])
# get a label for the funky column names
date_label, time_label = tuple(df.columns.values)[0:2]
# merge the columns into a single datetime
dates = [
dt.datetime.strptime('T'.join(ts) + '000', '%Y-%m-%dT%H:%M:%S.%f')
for ts in zip(df[date_label], df[time_label])]
# save the new column
df['DateTime'] = pd.Series(dates).values
Update:更新:
I have submitted a bug and a pull request for this issue.我已针对此问题提交了错误和拉取请求。 In response to the bug, jreback (pandas lead maintainer) gave a fairly detailed response about issues with the multi-level header from the example.
针对该错误, jreback (pandas 主要维护者)对示例中的多级标头问题给出了相当详细的答复。 I think you are already aware of these issues, but you may want to read what he wrote.
我认为您已经意识到这些问题,但您可能想阅读他写的内容。 At the end of the response he had this bit that may provide a work around:
在回复的最后,他有一点可以提供解决方法:
Making a single level is just not useful in a multi-level frame.制作单个关卡在多层次框架中是没有用的。 I would probably do this:
我可能会这样做:
In [25]: pandas.read_csv(StringIO(data), header=0, skiprows=1, parse_dates={'datetime':[0,1]})
Out[25]:
datetime A B A.1 B.1
0 2017-01-21 01:57:49.390 0 1 2 3
1 2017-01-21 01:57:50.400 4 5 7 9
2 2017-01-21 01:57:51.410 3 2 4 1
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.