向量化 Pandas 为 tz_convert 应用 function

Question

I'm have a dataframe where the hour column contains datetime data in UTC.我有一个 dataframe ，其中hour列包含 UTC 日期时间数据。 I have a time_zone column with time zones for each observation, and I'm using it to convert hour to the local time and save it in a new column named local_hour .我有一个time_zone列，其中包含每个观察的时区，我使用它将hour转换为本地时间并将其保存在名为local_hour的新列中。 To do this, I'm using the following code:为此，我使用以下代码：

import pandas as pd

# Sample dataframe
import pandas as pd
df = pd.DataFrame({
    'hour': ['2019-01-01 05:00:00', '2019-01-01 07:00:00', '2019-01-01 08:00:00'],
    'time_zone': ['US/Eastern', 'US/Central', 'US/Mountain']
})

# Ensure hour is in datetime format and localized to UTC
df['hour'] = pd.to_datetime(df['hour']).dt.tz_localize('UTC')

# Add local_hour column with hour in local time 
df['local_hour'] = df.apply(lambda row: row['hour'].tz_convert(row['time_zone']), axis=1)

df
    hour                        time_zone   local_hour
0   2019-01-01 05:00:00+00:00   US/Eastern  2019-01-01 00:00:00-05:00
1   2019-01-01 07:00:00+00:00   US/Central  2019-01-01 01:00:00-06:00
2   2019-01-01 08:00:00+00:00   US/Mountain 2019-01-01 01:00:00-07:00

The code works.该代码有效。 However using apply runs quite slowly since in reality I have a large dataframe.但是使用apply运行速度很慢，因为实际上我有一个很大的 dataframe。 Is there a way to vectorize this or otherwise speed it up?有没有办法对此进行矢量化或以其他方式加快速度？

Note: I have tried using the swifter package, but in my case it doesn't speed things up.注意：我尝试过使用更快速的swifter ，但在我的情况下它并没有加快速度。

Answer 1

From the assumption there is not an infinite number of time_zone, maybe you could perform a tz_convert per group, like:假设没有无限数量的 time_zone，也许您可以为每组执行一次tz_convert ，例如：

df['local_hour'] = df.groupby('time_zone')['hour'].apply(lambda x: x.dt.tz_convert(x.name))
print (df)

                       hour    time_zone                 local_hour
0 2019-01-01 05:00:00+00:00   US/Eastern  2019-01-01 00:00:00-05:00
1 2019-01-01 07:00:00+00:00   US/Central  2019-01-01 01:00:00-06:00
2 2019-01-01 08:00:00+00:00  US/Mountain  2019-01-01 01:00:00-07:00

On the sample it will be probably slower than what you did, but on bigger data and groups, should be faster在样本上它可能会比你做的慢，但在更大的数据和组上，应该更快

For speed comparison, with the df of 3 rows you provided, it gives:对于速度比较，使用您提供的 3 行的df ，它给出：

%timeit df.apply(lambda row: row['hour'].tz_convert(row['time_zone']), axis=1)
# 1.6 ms ± 102 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit df.groupby('time_zone')['hour'].apply(lambda x: x.dt.tz_convert(x.name))
# 2.58 ms ± 126 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

so apply is faster, but if you create a dataframe 1000 times bigger but with only 3 time_zones, then you get groupby about 20 times faster:所以apply更快，但是如果你创建一个 dataframe 1000 倍但只有 3 个 time_zones，那么你得到 groupby 大约 20 倍：

df = pd.concat([df]*1000, ignore_index=True)

%timeit df.apply(lambda row: row['hour'].tz_convert(row['time_zone']), axis=1)
# 585 ms ± 42.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit df.groupby('time_zone')['hour'].apply(lambda x: x.dt.tz_convert(x.name))
# 27.5 ms ± 2.15 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

向量化 Pandas 为 tz_convert 应用 function

问题描述

1 个解决方案

解决方案1
2 已采纳 2020-04-20 20:05:10

向量化 Pandas 为 tz_convert 应用 function

问题描述

1 个解决方案

解决方案1 2 已采纳 2020-04-20 20:05:10

解决方案1
2 已采纳 2020-04-20 20:05:10