简体   繁体   中英

Faster conversion of Pandas datetime colum to string

I have a major performance issue with converting Pandas dataframe date columns to string columns. Currently I am working with the code below. The column date_column (generally) contains datetime objects which I would like to transform to more human-readable string by using strftime('%d.%m.%Y'). Because it happened with previous data I used in my model I check if the values are string or NA, ie no datetime object. In that case I would like the output to be 'no date'.

Unfortunately my approach is extremely slow. Especially with large data (10m rows and more) it can take up to 1-2 minutes or even longer. Any recommendations with regards to performance improvement is much appreciated. So far I could not find any solutions on stackoverflow.

Thank you!

dataframe_input.loc[:,date_column] = (dataframe_input.loc[:,date_column].map(lambda x: x.strftime('%d.%m.%Y') if pd.notnull(x) and not isinstance(x, str) else "no date")).apply(str)

If it is correct to assume that a large number of records will have the same date (which seems likely for a dataset with 10M records), we can leverage that and improve efficiency by not converting the same date to string over and over.

For example, here's how it would look like on per-second data from 2021-01-01 to 2021-02-01 (which is about 2.7M records):

df = pd.DataFrame({'dt': pd.date_range('2021-01-01', '2021-02-01', freq='1s')})

Here's with the strftime applied to the whole column:

%%time
df['dt_str'] = df['dt'].dt.strftime('%d.%m.%Y')

Output:

CPU times: user 8.07 s, sys: 63.9 ms, total: 8.14 s
Wall time: 8.14 s

And here's with map applied to de-duplicated values:

%%time
dts = df['dt'].astype('datetime64[D]').drop_duplicates()
m = pd.Series(dts.dt.strftime('%d.%m.%Y'), dts)
df['dt_str'] = df['dt'].map(m)

Output:

CPU times: user 207 ms, sys: 32 ms, total: 239 ms
Wall time: 240 ms

It's about 30x faster. Of course, the speedup depends on the number of unique date values -- the higher the number, the less we gain by using this method.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM