简体   繁体   English

日期范围函数的矢量化,用于转换熊猫数据框

[英]vectorization of date range function for transformation of pandas dataframe

This is a problem of transforming date range to a numerical values based on the current date. 这是将日期范围转换为基于当前日期的数值的问题。

Input table: 输入表:

   ID   START_DATE  END_DATE    CURRENT_DATE
    1   2010-12-08  2011-03-01  2011-04-01
    2   2010-12-10  2011-01-12  2011-01-02
    3   2010-12-16  2011-03-07  2010-10-10

Output table: 输出表:

   ID   START_DATE  END_DATE    CURRENT_DATE    number_of_days
   1    2010-12-08  2011-03-01  2011-04-01      78.148490
   2    2010-12-10  2011-01-12  2011-01-02      23.726149
   3    2010-12-16  2011-03-07  2010-10-10      0.000000

where nubmer_of_days is computed based on an exponential decay function, followed by summation of all values for one row. 其中nubmer_of_days是根据指数衰减函数计算的,然后将一行的所有值相加。

We can implement a function as follows: 我们可以实现如下功能:

def transform(start, end, current):
    value = 0
    if current > end: #current date is later than the end date
        delta = end - start 
        for i in range(delta.days + 1):
            diff = current - (start + td(days = i))
            value += math.exp(- 0.001 * diff.days)
    elif current > start: #current date is between the start and end
        delta = current - start
        for i in range(delta.days + 1):
            diff = current - (start + td(days = i))
            value += math.exp(-0.001 * diff.days)
    else:
        pass
    return value

and then apply the below transformation: 然后应用以下转换:

df['number_of_days'] = df.apply(lambda x: transform(x['START_DATE'], x['END_DATE'], x['CURRENT_DATE']),axis=1)

However, this is very slow for a table with millions of rows and huge date range. 但是,对于具有数百万行和巨大日期范围的表,这非常慢。

Any idea on how to speed up the process by vectorizing the inner for loop in the transformation function? 关于如何通过对转换函数中的内部for循环进行矢量化来加速处理的任何想法?

Thank you! 谢谢!

You could vectorize using numpy array functions to calculate the exponential decay. 您可以使用numpy array函数进行矢量化以计算指数衰减。

df = df[df.CURRENT_DATE > df.START_DATE] # just focusing on cases with calculation

Get the relevant delta depending on CURRENT_DATE and END_DATE : 根据CURRENT_DATEEND_DATE获取相关的delta

delta = df[['END_DATE', 'CURRENT_DATE']].min(axis=1).subtract(df.START_DATE).dt.days.add(1)

Calculate the shift of the arange() for exponential decay as max of difference between END_DATE and CURRENT_DATE or 0 : 计算END_DATECURRENT_DATE0之间的差的max ,以指数衰减形式计算arange()shift

shift = df.CURRENT_DATE.subtract(df.END_DATE).dt.days.clip(lower=0)

Produce and process the (adjusted) arange objects using np.exp() and np.sum() : 产生和处理(调节) arange使用对象np.exp()np.sum()

df['number_of_days'] = [np.sum(np.exp(-0.001 * (np.arange(d) + s))) for d, s in zip(delta.values, shift.values)]

to get: 要得到:

   START_DATE   END_DATE CURRENT_DATE  number_of_days
ID                                                   
1  2010-12-08 2011-03-01   2011-04-01       78.148490
2  2010-12-10 2011-01-12   2011-01-02       23.726149

If you compare performance, you see the efficiency gains from saving on loops: 如果比较性能,您会发现通过保存循环可以提高效率:

df_test = pd.concat([df for _ in range(100000)])

def transform1(df):
    df = df[df.CURRENT_DATE > df.START_DATE]
    delta = df[['END_DATE', 'CURRENT_DATE']].min(axis=1).subtract(df.START_DATE).dt.days.add(1)
    shift = df.CURRENT_DATE.subtract(df.END_DATE).dt.days.clip(lower=0)
    return [np.sum(np.exp(-0.001 * (np.arange(d) + s))) for d, s in zip(delta.values, shift.values)]

%timeit transform1(df_test)
1 loop, best of 3: 4.99 s per loop

def transform2(df):
    df['end'] = [d.days for d in df.CURRENT_DATE - df.START_DATE]
    df['start'] = (df.end - [max(0, d.days + 1) for d in (df.END_DATE.where(df.CURRENT_DATE > df.END_DATE, df.CURRENT_DATE) - df.START_DATE)])
    df['number_of_days'] = [sum(np.exp(-0.001 * i) for i in np.arange(stop, start, -1)) for start, stop in zip(df.start, df.end)]
    df.drop(['start', 'end'], axis=1, inplace=True)

%timeit transform2(df_test)
1 loop, best of 3: 36.7 s per loop

You want to get the start and end (integers) for each date range. 您想要获取每个日期范围的开始和结束(整数)。 Then it is relatively easy to vectorize the number_of_days calculation. 然后,将number_of_days计算向量化是相对容易的。

df['end'] = [d.days for d in df.CURRENT_DATE - df.START_DATE]
df['start'] = (
    df.end - [max(0, d.days + 1) 
              for d in (df.END_DATE.where(df.CURRENT_DATE > df.END_DATE, df.CURRENT_DATE) 
                        - df.START_DATE)])

df['number_of_days'] = [sum(np.exp(-0.001 * i) for i in np.arange(stop, start, -1)) 
                        for start, stop in zip(df.start, df.end)]
df.drop(['start', 'end'], axis=1, inplace=True)

>>> df
   ID START_DATE   END_DATE CURRENT_DATE  number_of_days
0   1 2010-12-08 2011-03-01   2011-04-01       78.148490
1   2 2010-12-10 2011-01-12   2011-01-02       23.726149
2   3 2010-12-16 2011-03-07   2010-10-10        0.000000

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM