简体   繁体   中英

Feature engineered multiple columns of pandas data frame (add new columns based on existing ones)

Sorry being naive. I have the following data and I want to feature engineered some columns. But I don't have how I can do multiple operations on the same data frame. One thing to mention I have multiple entries for each customer. So, in the end, I want aggregated values (ie 1 entry for each customer)

    customer_id purchase_amount date_of_purchase    days_since
 0    760             25.0         06-11-2009             2395
 1    860             50.0         09-28-2012             1190
 2   1200             100.0        10-25-2005             3720
 3   1420             50.0         09-07-2009             2307
 4   1940             70.0         01-25-2013             1071

new column based on min, count and mean

customer_purchases['amount'] = customer_purchases.groupby(['customer_id'])['purchase_amount'].agg('min')
customer_purchases['frequency'] = customer_purchases.groupby(['customer_id'])['days_since'].agg('count')
customer_purchases['recency'] = customer_purchases.groupby(['customer_id'])['days_since'].agg('mean')

nexpected outcome

customer_id purchase_amount date_of_purchase    days_since  recency frequency   amount  first_purchase
0   760         25.0      06-11-2009              2395       1273      5             38.000000  3293
1   860         50.0      09-28-2012              1190        118      10            54.000000  3744
2   1200       100.0      10-25-2005              3720        1192     9            102.777778  3907
3   1420        50.0      09-07-2009              2307         142     34            51.029412     3825
4  1940        70.0       01-25-2013              1071         686     10              47.500000    3984

One solution:

I can think of 3 separate operations for each needed column and then join all those to get a new data frame. I know it's not efficient for just sake what I need

df_1 = customer_purchases.groupby('customer_id', sort = False)["purchase_amount"].min().reset_index(name ='amount')

df_2 = customer_purchases.groupby('customer_id', sort = False)["days_since"].count().reset_index(name ='frequency')

df_3 = customer_purchases.groupby('customer_id', sort = False)["days_since"].mean().reset_index(name ='recency')

However, either I get an error or not data frame with correct data. Your help and patience will be appreciated.

SOLUTION

finally i found the solution

def f(x):
        recency        = x['days_since'].min()
        frequency      = x['days_since'].count()
        monetary_value = x['purchase_amount'].mean()
        c = ['recency','frequency, monetary_value']
        return pd.Series([recency, frequency, monetary_value], index =c )
    
    df1 = customer_purchases.groupby('customer_id').apply(f)
    print (df1)

Use instead

 customer_purchases.groupby('customer_id')['purchase_amount'].transform(lambda x : x.min()) 

Transform will give output for each row of original dataframe instead of grouped row as in case of using agg

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM