做这些数据框操作的更快方法？

Question

I am loading a dataframe from csv, and then performing the operations below. 我正在从csv加载数据帧，然后执行以下操作。 Loading the dataframe takes about 2 seconds. 加载数据框大约需要2秒钟。 The other operations ( mainly the date conversions ) take 30 seconds. 其他操作（主要是日期转换）需要30秒。 Is there a way to speed up the other operations? 有没有办法加快其他操作？

df = pd.DataFrame.from_csv( fn, index_col=None )

df['SCHEDULED_OFF'] = pd.to_datetime( df['SCHEDULED_OFF'], format='%d-%m-%Y %H:%M' )
df['LATEST_TAKEN'] = pd.to_datetime( df['LATEST_TAKEN'], format='%d-%m-%Y %H:%M:%S' )
df['FIRST_TAKEN'] = pd.to_datetime( df['FIRST_TAKEN'], format='%d-%m-%Y %H:%M:%S' )
df['SETTLED_DATE'] = pd.to_datetime( df['SETTLED_DATE'], format='%d-%m-%Y %H:%M:%S' )
df['ACTUAL_OFF'] = pd.to_datetime( df['ACTUAL_OFF'], format='%d-%m-%Y %H:%M:%S' )
df['ACTUAL_OFF'] = df['ACTUAL_OFF'].fillna( pd.datetime.min )
df[ 'LATEST_TAKEN_FROM_SCHEDULED_OFF' ] = ( df['SCHEDULED_OFF'].values -df['LATEST_TAKEN'].values ) / np.timedelta64( 1, 's' )
df[ 'FIRST_TAKEN_FROM_SCHEDULED_OFF' ] = ( df['SCHEDULED_OFF'].values -df['FIRST_TAKEN'].values ) / np.timedelta64( 1, 's' )
df[ 'IN_PLAY' ] = [ dicInPlay[ x ] for x in df[ 'IN_PLAY' ] ]
df['COUNTRY'] = df['COUNTRY'].fillna( '' )
df['FULL_DESCRIPTION'] = df['FULL_DESCRIPTION'].fillna( '' )
df['EVENT'] = df['EVENT'].fillna( '' )
df['COURSE'] = df['COURSE'].fillna( '' )

Answer 1

Not really a solution, but a way to do this faster is having the dates in standard ISO format ... 并不是真正的解决方案，但是更快地执行此操作的方法是将日期设置为标准ISO格式...

To illustrate this can make a big difference, some timings (with a column of 10000 date strings): 为了说明这一点，可以做一些大的改变（使用10000个日期字符串的列）：

# with standard ISO formatted strings (%Y-%m-%d %H:%M:%S)
In [52]: %timeit pd.to_datetime(df['date'])
100 loops, best of 3: 2.88 ms per loop

# with your dayfirst-like format (%d-%m-%Y %H:%M)
In [66]: %timeit pd.to_datetime(df['date'], format='%d-%m-%Y %H:%M')
10 loops, best of 3: 78.2 ms per loop

In [67]: %timeit pd.to_datetime(df['date'], dayfirst=True)
1 loops, best of 3: 800 ms per loop

So I think part of the reason it is slow, is this date parsing (20-30 time slowdown when not having standard ISO format). 因此，我认为速度较慢的部分原因是此日期解析（不具有标准ISO格式时，速度会降低20-30倍）。 And I don't know if this can be further enhanced if you can't change the format. 我不知道如果您不能更改格式是否可以进一步增强它。

For the other lines I don't directly see a possible spead-up, only for [ dicInPlay[ x ] for x in df[ 'IN_PLAY' ] ] you could test if df['IN_PLAY'].map(dicInPlay) is faster. 对于其他的行，我没有直接看到可能的加速，仅对于[ dicInPlay[ x ] for x in df[ 'IN_PLAY' ] ]您可以测试df['IN_PLAY'].map(dicInPlay)是否更快。

做这些数据框操作的更快方法？

问题描述

1 个解决方案

解决方案1
2 2014-09-01 12:18:13

做这些数据框操作的更快方法？

问题描述

1 个解决方案

解决方案1 2 2014-09-01 12:18:13

解决方案1
2 2014-09-01 12:18:13