简体   繁体   English

矢量化实现,用于在pandas数据帧中的单个行中创建多个行

[英]Vectorized implementation to create multiple rows from a single row in pandas dataframe

For each row in the input table, I need to generate multiple rows by separating the date range based on monthly. 对于输入表中的每一行,我需要通过基于每月分隔日期范围来生成多行。 (please refer to the below sample output). (请参阅以下示例输出)。

There is a simple iterative approach to convert row by row, but it is very slow on large dataframes. 有一种简单的迭代方法可以逐行转换,但在大型数据帧上却非常慢。

Could anyone suggest a vectorized approach, such as using apply(), map() etc. to achieve the objective? 任何人都可以建议使用矢量化方法,例如使用apply(),map()等来实现目标吗?

The output table is a new table. 输出表是一个新表。

Input: 输入:

ID, START_DATE, END_DATE
1, 2010-12-08, 2011-03-01
2, 2010-12-10, 2011-01-12
3, 2010-12-16, 2011-03-07

Output: 输出:

ID, START_DATE, END_DATE, NUMBER_DAYS, ACTION_DATE
1, 2010-12-08, 2010-12-31, 23, 201012
1, 2010-12-08, 2011-01-31, 54, 201101
1, 2010-12-08, 2011-02-28, 82, 201102
1, 2010-12-08, 2011-03-01, 83, 201103
2, 2010-12-10, 2010-12-31, 21, 201012
2, 2010-12-10, 2011-01-12, 33, 201101
3, 2010-12-16, 2010-12-31, 15, 201012
4, 2010-12-16, 2011-01-31, 46, 201101
5, 2010-12-16, 2011-02-28, 74, 201102
6, 2010-12-16, 2011-03-07, 81, 201103

I think you can use: 我想你可以用:

import pandas as pd

df = pd.DataFrame({'ID': {0: 1, 1: 2, 2: 3}, 
'END_DATE': {0: pd.Timestamp('2011-03-01 00:00:00'),
             1: pd.Timestamp('2011-01-12 00:00:00'), 
             2: pd.Timestamp('2011-03-07 00:00:00')}, 
'START_DATE': {0: pd.Timestamp('2010-12-08 00:00:00'), 
               1: pd.Timestamp('2010-12-10 00:00:00'), 
               2: pd.Timestamp('2010-12-16 00:00:00')}}, 
columns=['ID','START_DATE', 'END_DATE'])

print df
   ID START_DATE   END_DATE
0   1 2010-12-08 2011-03-01
1   2 2010-12-10 2011-01-12
2   3 2010-12-16 2011-03-07

#if multiple columns, you can filter them by subset
#df = df[['ID','START_DATE', 'END_DATE']]

#stack columns START_DATE and END_DATE
df1 = df.set_index('ID')
        .stack()
        .reset_index(level=1, drop=True)
        .to_frame()
        .rename(columns={0:'Date'})
#print df1

#resample and fill missing data 
df1 = df1.groupby(df1.index).apply(lambda x: x.set_index('Date').resample('M').asfreq())
         .reset_index()
print df1

   ID       Date
0   1 2010-12-31
1   1 2011-01-31
2   1 2011-02-28
3   1 2011-03-31
4   2 2010-12-31
5   2 2011-01-31
6   3 2010-12-31
7   3 2011-01-31
8   3 2011-02-28
9   3 2011-03-31

There is problem with last day of Month , because resample add last day of Month , so first create period columns and then merge them. Month最后一天有问题,因为resample添加了Month最后一天,所以首先创建period列然后merge它们。 By combine_first add missing values from column Date and by bfill add missing values of column START_DATE . 通过combine_first从列Date添加缺失值,并通过bfill添加列START_DATE缺失值。

df['period'] = df.END_DATE.dt.to_period('M')
df1['period'] = df1.Date.dt.to_period('M')

df2 = pd.merge(df1, df, on=['ID','period'], how='left')

df2['END_DATE'] = df2.END_DATE.combine_first(df2.Date)
df2['START_DATE'] = df2.START_DATE.bfill()
df2 = df2.drop(['Date','period'], axis=1)

Last add new columns by difference with dt.days and dt.strftime : 最后通过与dt.daysdt.strftime差异添加新列:

df2['NUMBER_DAYS'] = (df2.END_DATE - df2.START_DATE).dt.days
df2['ACTION_DATE'] = df2.END_DATE.dt.strftime('%Y%m')

print df2
   ID START_DATE   END_DATE  NUMBER_DAYS ACTION_DATE
0   1 2010-12-08 2010-12-31           23      201012
1   1 2010-12-08 2011-01-31           54      201101
2   1 2010-12-08 2011-02-28           82      201102
3   1 2010-12-08 2011-03-01           83      201103
4   2 2010-12-10 2010-12-31           21      201012
5   2 2010-12-10 2011-01-12           33      201101
6   3 2010-12-16 2010-12-31           15      201012
7   3 2010-12-16 2011-01-31           46      201101
8   3 2010-12-16 2011-02-28           74      201102
9   3 2010-12-16 2011-03-07           81      201103

You can also try this. 你也可以试试这个。 Using Pandas date_range function and DataFrame apply concept. 使用Pandas date_range函数和DataFrame应用概念。

In your Ouptut, for the ID after 3, you have mentioned 4,5,6. 在您的Ouptut中,对于3之后的ID,您提到了4,5,6。 I believe it should be 3. Please check. 我相信它应该是3.请检查。

import pandas as pd
from datetime import datetime

l_ret_df = pd.DataFrame(columns=('ID', 'START_DATE', 'END_DATE', 'NUMBER_DAYS', 'ACTION_DATE'))

def generate_ts_df(p_row):
    l_id = p_row['ID']
    l_start = p_row['START_DATE']
    l_start_date = datetime.strptime(l_start,'%Y-%m-%d')
    l_end = p_row['END_DATE']
    l_end_date = datetime.strptime(l_end,'%Y-%m-%d')
    l_df = pd.date_range(start=l_start,end=l_end,freq='M',closed=None)
    global l_ret_df

    for e in l_df:
        l_ret_df = l_ret_df.append(pd.DataFrame([[l_id,l_start,e.date(),(e.date()-l_start_date.date()).days,e.strftime('%Y%m')]],columns=('ID', 'START_DATE', 'END_DATE', 'NUMBER_DAYS', 'ACTION_DATE')))
    l_ret_df = l_ret_df.append(pd.DataFrame([[l_id,l_start,l_end,(l_end_date.date()-l_start_date.date()).days,l_end_date.strftime('%Y%m')]],columns=('ID', 'START_DATE', 'END_DATE', 'NUMBER_DAYS', 'ACTION_DATE')))
    return 1

if __name__ == "__main__":
    l_ts_base = pd.DataFrame([[1, '2010-12-08', '2011-03-01'],
                            [2, '2010-12-10', '2011-01-12'],
                            [3, '2010-12-16', '2011-03-07']], columns=('ID', 'START_DATE', 'END_DATE'))

    l_ts_base.apply(generate_ts_df, axis=1)
    print l_ret_df

Output 产量

   ID  START_DATE    END_DATE  NUMBER_DAYS ACTION_DATE
0   1  2010-12-08  2010-12-31           23      201012
0   1  2010-12-08  2011-01-31           54      201101
0   1  2010-12-08  2011-02-28           82      201102
0   1  2010-12-08  2011-03-01           83      201103
0   2  2010-12-10  2010-12-31           21      201012
0   2  2010-12-10  2011-01-12           33      201101
0   3  2010-12-16  2010-12-31           15      201012
0   3  2010-12-16  2011-01-31           46      201101
0   3  2010-12-16  2011-02-28           74      201102
0   3  2010-12-16  2011-03-07           81      201103

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM