[英]vectorization of date range function for transformation of pandas dataframe
This is a problem of transforming date range to a numerical values based on the current date. 这是将日期范围转换为基于当前日期的数值的问题。
Input table: 输入表:
ID START_DATE END_DATE CURRENT_DATE
1 2010-12-08 2011-03-01 2011-04-01
2 2010-12-10 2011-01-12 2011-01-02
3 2010-12-16 2011-03-07 2010-10-10
Output table: 输出表:
ID START_DATE END_DATE CURRENT_DATE number_of_days
1 2010-12-08 2011-03-01 2011-04-01 78.148490
2 2010-12-10 2011-01-12 2011-01-02 23.726149
3 2010-12-16 2011-03-07 2010-10-10 0.000000
where nubmer_of_days is computed based on an exponential decay function, followed by summation of all values for one row. 其中nubmer_of_days是根据指数衰减函数计算的,然后将一行的所有值相加。
We can implement a function as follows: 我们可以实现如下功能:
def transform(start, end, current):
value = 0
if current > end: #current date is later than the end date
delta = end - start
for i in range(delta.days + 1):
diff = current - (start + td(days = i))
value += math.exp(- 0.001 * diff.days)
elif current > start: #current date is between the start and end
delta = current - start
for i in range(delta.days + 1):
diff = current - (start + td(days = i))
value += math.exp(-0.001 * diff.days)
else:
pass
return value
and then apply the below transformation: 然后应用以下转换:
df['number_of_days'] = df.apply(lambda x: transform(x['START_DATE'], x['END_DATE'], x['CURRENT_DATE']),axis=1)
However, this is very slow for a table with millions of rows and huge date range. 但是,对于具有数百万行和巨大日期范围的表,这非常慢。
Any idea on how to speed up the process by vectorizing the inner for loop in the transformation function? 关于如何通过对转换函数中的内部for循环进行矢量化来加速处理的任何想法?
Thank you! 谢谢!
You could vectorize using numpy array
functions to calculate the exponential decay. 您可以使用
numpy array
函数进行矢量化以计算指数衰减。
df = df[df.CURRENT_DATE > df.START_DATE] # just focusing on cases with calculation
Get the relevant delta
depending on CURRENT_DATE
and END_DATE
: 根据
CURRENT_DATE
和END_DATE
获取相关的delta
:
delta = df[['END_DATE', 'CURRENT_DATE']].min(axis=1).subtract(df.START_DATE).dt.days.add(1)
Calculate the shift
of the arange()
for exponential decay as max
of difference between END_DATE
and CURRENT_DATE
or 0
: 计算
END_DATE
与CURRENT_DATE
或0
之间的差的max
,以指数衰减形式计算arange()
的shift
:
shift = df.CURRENT_DATE.subtract(df.END_DATE).dt.days.clip(lower=0)
Produce and process the (adjusted) arange
objects using np.exp()
and np.sum()
: 产生和处理(调节)
arange
使用对象np.exp()
和np.sum()
df['number_of_days'] = [np.sum(np.exp(-0.001 * (np.arange(d) + s))) for d, s in zip(delta.values, shift.values)]
to get: 要得到:
START_DATE END_DATE CURRENT_DATE number_of_days
ID
1 2010-12-08 2011-03-01 2011-04-01 78.148490
2 2010-12-10 2011-01-12 2011-01-02 23.726149
If you compare performance, you see the efficiency gains from saving on loops: 如果比较性能,您会发现通过保存循环可以提高效率:
df_test = pd.concat([df for _ in range(100000)])
def transform1(df):
df = df[df.CURRENT_DATE > df.START_DATE]
delta = df[['END_DATE', 'CURRENT_DATE']].min(axis=1).subtract(df.START_DATE).dt.days.add(1)
shift = df.CURRENT_DATE.subtract(df.END_DATE).dt.days.clip(lower=0)
return [np.sum(np.exp(-0.001 * (np.arange(d) + s))) for d, s in zip(delta.values, shift.values)]
%timeit transform1(df_test)
1 loop, best of 3: 4.99 s per loop
def transform2(df):
df['end'] = [d.days for d in df.CURRENT_DATE - df.START_DATE]
df['start'] = (df.end - [max(0, d.days + 1) for d in (df.END_DATE.where(df.CURRENT_DATE > df.END_DATE, df.CURRENT_DATE) - df.START_DATE)])
df['number_of_days'] = [sum(np.exp(-0.001 * i) for i in np.arange(stop, start, -1)) for start, stop in zip(df.start, df.end)]
df.drop(['start', 'end'], axis=1, inplace=True)
%timeit transform2(df_test)
1 loop, best of 3: 36.7 s per loop
You want to get the start and end (integers) for each date range. 您想要获取每个日期范围的开始和结束(整数)。 Then it is relatively easy to vectorize the
number_of_days
calculation. 然后,将
number_of_days
计算向量化是相对容易的。
df['end'] = [d.days for d in df.CURRENT_DATE - df.START_DATE]
df['start'] = (
df.end - [max(0, d.days + 1)
for d in (df.END_DATE.where(df.CURRENT_DATE > df.END_DATE, df.CURRENT_DATE)
- df.START_DATE)])
df['number_of_days'] = [sum(np.exp(-0.001 * i) for i in np.arange(stop, start, -1))
for start, stop in zip(df.start, df.end)]
df.drop(['start', 'end'], axis=1, inplace=True)
>>> df
ID START_DATE END_DATE CURRENT_DATE number_of_days
0 1 2010-12-08 2011-03-01 2011-04-01 78.148490
1 2 2010-12-10 2011-01-12 2011-01-02 23.726149
2 3 2010-12-16 2011-03-07 2010-10-10 0.000000
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.