简体   繁体   中英

Specific aggregations for data in weekly timeframes in Python (Pandas)

I am currently working on a problem I am having trouble trying to figure out the best way to come up with a solution, maybe you guys can help me.

I have a dataset with calls from a Customer Relationship call center and I need to aggregate it in a specific way. The company is investigating the behavior of new customers, and they believe new customers tend to call more often than old customers (which is expected, but I need to make it visual). So I need to know from clients who have entered the company in a certain period, how many phone calls received by the call center were from the new clients in the same period, and in the subsequent periods.

Basically, from clients who subscribed in the week 1, how many of them called in the week 1, how many of them called in the week 2, and so forth.

Here's a sneak peek of the dataset:

数据集的基础知识

date_ref is the day of the call. cred_date is the date of the subscription.

I have come across a solution using boolean indexing with pandas and, boy, does the code look ugly. I am not very confident this is reliable as well: Here's what I have done so far:

# Aggregation functions to be passed to groupby
aggregation = {
    'n_creds': ('account_id', pd.Series.nunique),
    'n_calls': ('date_ref', 'count')
}

# Groupby splitting dates in weeks and with specified aggregations
mcases_agg = mcases_ready.groupby(pd.Grouper(key = 'cred_date', freq = 'W')).agg(**aggregation)
mcases_idx_list = mcases_agg.index.tolist()

n_calls_list = []
for i, _ in enumerate(mcases_idx_list):
    if i == 0:
        df = mcases[mcases['cred_date'] <= mcases_idx_list[i]]
        n_calls_from_cred_this_week = df[(df['date_ref'] >= mcases_idx_list[i]) & \  
                                         (df['date_ref'] < (mcases_idx_list[i + 1]))]['account_id'].nunique()
        n_calls_list.append(n_calls_from_cred_this_week)
    
    elif i != len(mcases_idx_list) - 1:
        df = mcases[mcases['cred_date'] <= mcases_idx_list[i]]
        n_calls_from_cred_this_week = df[(df['date_ref'] >= mcases_idx_list[i]) & \ 
                                         (df['date_ref'] < (mcases_idx_list[i + 1]))]['account_id'].nunique()
        n_calls_list.append(n_calls_from_cred_this_week)
    
    else:    
        df = mcases[mcases['cred_date'] <= mcases_idx_list[i]]
        n_calls_from_cred_this_week = df[(df['date_ref'] >= mcases_idx_list[i])]['account_id'].nunique()
        n_calls_list.append(n_calls_from_cred_this_week)

I would like to hear from the community if you guys have faced a similar problem and how did you solve it, and if you haven't please share your suggestions of implementing a more straight-to-the-point piece of code with some tool I am not familiar with.

Thanks!

After breaking my head a little I came with a much better solution for my problem (a proof that a tired mind has limited resources).

  1. First I computed the days between every call ( date_ref ) and the subscription date ( cred_date ) and saved it to a new column, interval_cred_call :

    mcases['interval_cred_call'] = (mcases['date_ref'] - mcases['cred_date']).dt.days

  2. From this value, I created specific columns for a weekly time spam, classifying the calls as bools if they were < 7 days and saving to the column Week #1 , then between 7 and 14 days and saving it to the column Week #2 , and so forth...

  3. Then I automatized the task iterating through the values from with a function, that could either result in a pd.DataFrame or a horizontal bar plot.

Below is the function I coded:

def n_call_timewindow(df, n_days = 100, plot = False):
    '''
    This function calculates the number of calls (cases) after n_days // 7 entrance in the cred database and returns a pd.DataFrame with columns being the weeks ahead of entrance.
    
    Parameters:
 
    ** n_days: 
    Number of days to be included in the analysis. Please observe that the output will be in weeks, therefore the number of days will be divided by 7, and remainders will be discarded.
    
    ** plot (default, False)
    If set to True, will plot a horizontal bar chat instead of showing a pd.DataFrame.\n
    ::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
    '''
    df = df.copy()
    
    week = n_days // 7
    for i in range(0, week + 1):
        if i == 0:
            df[f'Semana #{i + 1}'] = (df['interval_cred_call'] <= (i + 7)).astype(int)
        else:
            df[f'Semana #{i + 1}'] = ((df['interval_cred_call'] > (i * 7)) & 
                                             (df['interval_cred_call'] <= ((i + 1) * 7))).astype(int)

    
    df = df.iloc[:, -(week + 1):-1]
    
    if plot == True:
        fig, ax = plt.subplots(1, figsize = (15, 5))
        fig.suptitle('Total de chamados realizados após Credenciamento\n(por semana, a partir da entrada na base)',
                     fontsize = 20, color = 'darkgreen')
        fig.tight_layout()
        df.sum().plot(kind = 'barh', alpha = .8, width = .8, color = 'limegreen', zorder = 2).invert_yaxis()
        ax.grid(b = True, axis = 'x', alpha = .2, zorder = 1)
        ax.tick_params(axis='both', colors='gray', labelsize = 14, labelcolor = 'dimgray')
        ax.spines['left'].set_color('lightgray')
        ax.spines['bottom'].set_visible(False)
        ax.spines['top'].set_visible(False)
        ax.spines['right'].set_visible(False)
        
        return f'{fig}'

    return df

And while setting the argument plot = True , that is the result:

在此处输入图像描述

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM