将日期字符串转换为pandas时间序列索引的最有效方法

Question

My CSV data file contains dates in the following format: 我的CSV数据文件包含以下格式的日期：

In: data["DayIndex"].unique()

Out: array(['04/23/17', '04/20/17', '04/21/17', '04/24/17', '04/22/17',
       '05/02/17', '04/27/17', '05/06/17', '04/30/17', '04/25/17',
       '04/26/17', '05/04/17'], dtype=object)

I want to turn it into a proper pandas time series. 我想把它变成一个适当的熊猫时间序列。 I've tried this: 我试过这个：

data["DayIndex"] = pandas.DatetimeIndex(data["Day"])

It takes ages even for a few hundred thousand rows. 即使是几十万行也需要很长时间。 What are my options to speed up the parsing? 我有什么选择加快解析速度？

Answer 1

data['DayIndex'] = pandas.to_datetime(data['Day'])

Incorporating @ayhan's comment 纳入@ ayhan的评论

data['DayIndex'] = pandas.to_datetime(data['Day'], format='%m/%d/%Y')

Or when you import from csv, include parse_dates 或者从csv导入时，请包含parse_dates

data = read_csv(..., parse_dates=['Day'],
     date_parser=lambda s: pandas.datetime.strptime(s, '%m/%d/%y'))

Not sure if this became incorporated into the most recent version of pandas... I don't think so, at least I didn't see it in the "What's New" section... 不确定这是否被纳入最新版本的熊猫...我不这么认为，至少我没有在“什么是新的”部分看到它...

Anyway, we can build a custom parser to reuse old parsings instead of reparsing the same dates we've already seen. 无论如何，我们可以构建一个自定义解析器来重用旧的解析，而不是重新解析我们已经看到的相同日期。

Let's use map and some clever hashing. 让我们使用map和一些聪明的哈希。

# let u be unique date stings.  We'll do this so that we only parse them once.
u = pd.unique(data['Day'])

# then build a dictionary of these
m = dict(zip(u, pd.to_datetime(u, format='%m/%d/%Y')))

# then use `map` to build the new column
data['DayIndex'] = data['Day'].map(m)

Timing 定时

a = np.random.choice(
    ['04/23/17', '04/20/17', '04/21/17', '04/24/17', '04/22/17',
     '05/02/17', '04/27/17', '05/06/17', '04/30/17', '04/25/17',
     '04/26/17', '05/04/17'],
    100000)

data = pd.DataFrame(dict(Day=a))


%%timeit
u = pd.unique(a)
m = dict(zip(u, pd.to_datetime(u, format='%m/%d/%y'))) 
data['Day'].map(m)
100 loops, best of 3: 15.4 ms per loop

%timeit pd.to_datetime(data['Day'], format='%m/%d/%y')
1 loop, best of 3: 206 ms per loop

将日期字符串转换为pandas时间序列索引的最有效方法

问题描述

1 个解决方案

解决方案1
3 已采纳 2017-05-18 17:53:24

将日期字符串转换为pandas时间序列索引的最有效方法

问题描述

1 个解决方案

解决方案1 3 已采纳 2017-05-18 17:53:24

解决方案1
3 已采纳 2017-05-18 17:53:24