[英]Faster conversion of Pandas datetime colum to string
I have a major performance issue with converting Pandas dataframe date columns to string columns.在将 Pandas dataframe 日期列转换为字符串列时,我遇到了一个主要的性能问题。 Currently I am working with the code below.
目前我正在使用下面的代码。 The column date_column (generally) contains datetime objects which I would like to transform to more human-readable string by using strftime('%d.%m.%Y').
date_column 列(通常)包含我想通过使用 strftime('%d.%m.%Y') 将其转换为更易于阅读的字符串的日期时间对象。 Because it happened with previous data I used in my model I check if the values are string or NA, ie no datetime object.
因为它发生在我在 model 中使用的先前数据中,所以我检查值是字符串还是 NA,即没有日期时间 object。 In that case I would like the output to be 'no date'.
在这种情况下,我希望 output 为“无日期”。
Unfortunately my approach is extremely slow.不幸的是,我的方法非常缓慢。 Especially with large data (10m rows and more) it can take up to 1-2 minutes or even longer.
特别是对于大数据(10m 行或更多),可能需要 1-2 分钟甚至更长的时间。 Any recommendations with regards to performance improvement is much appreciated.
非常感谢有关性能改进的任何建议。 So far I could not find any solutions on stackoverflow.
到目前为止,我在 stackoverflow 上找不到任何解决方案。
Thank you!谢谢!
dataframe_input.loc[:,date_column] = (dataframe_input.loc[:,date_column].map(lambda x: x.strftime('%d.%m.%Y') if pd.notnull(x) and not isinstance(x, str) else "no date")).apply(str)
If it is correct to assume that a large number of records will have the same date (which seems likely for a dataset with 10M records), we can leverage that and improve efficiency by not converting the same date to string over and over.如果假设大量记录具有相同的日期是正确的(对于具有 1000 万条记录的数据集来说似乎很可能),我们可以利用它并通过不一遍又一遍地将相同的日期转换为字符串来提高效率。
For example, here's how it would look like on per-second data from 2021-01-01 to 2021-02-01 (which is about 2.7M records):例如,这是从 2021-01-01 到 2021-02-01 的每秒数据(大约 270 万条记录)的样子:
df = pd.DataFrame({'dt': pd.date_range('2021-01-01', '2021-02-01', freq='1s')})
Here's with the strftime
applied to the whole column:这是应用于整个列的
strftime
:
%%time
df['dt_str'] = df['dt'].dt.strftime('%d.%m.%Y')
Output: Output:
CPU times: user 8.07 s, sys: 63.9 ms, total: 8.14 s
Wall time: 8.14 s
And here's with map
applied to de-duplicated values:这里将
map
应用于去重值:
%%time
dts = df['dt'].astype('datetime64[D]').drop_duplicates()
m = pd.Series(dts.dt.strftime('%d.%m.%Y'), dts)
df['dt_str'] = df['dt'].map(m)
Output: Output:
CPU times: user 207 ms, sys: 32 ms, total: 239 ms
Wall time: 240 ms
It's about 30x faster.它大约快 30 倍。 Of course, the speedup depends on the number of unique date values -- the higher the number, the less we gain by using this method.
当然,加速取决于唯一日期值的数量——数量越大,我们使用这种方法获得的收益就越少。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.