简体   繁体   中英

What is the cleanest way to group by and conditionally transform data within a Pandas Dataframe?

I am using pandas (0.25.3) and Python (3.7.4). I am working with a DataFrame similar to df1 below. I need to transform the "Hours" and "Wages" fields into the "Gross Hours", "Gross Wages", "Regular Wages" fields conditionally based on the value of the "Pay Code" field in the same DataFrame. I also need to group by "Check Date".

df1 = pd.DataFrame( {
                        "Pay Code" : ["1","4","OCH","3","3"],
                        "Check Date" : ["2019-01-04","2019-01-04","2019-01-04","2019-01-04","2019-01-18"],
                        "Pay Start Date" : ["2018-12-15","2018-12-15","2018-12-15","2018-12-15","2018-12-29"],
                        "Pay End Date" : ["2018-12-28","2018-12-28","2018-12-28","2018-12-28","2019-01-11"],
                        "Pay Code Description" : ["REGULAR PAY","HOLIDAY PAY","ON CALL HOURLY","VACATION PAY","VACATION PAY"],
                        "Hours" : [46.0,16.0,152.0,18.0,19.5],
                        "Wages" : [1226.58,426.64,63.33,479.98,530.38],
                        "Gross Hours" : ["NaN","NaN","NaN","NaN","NaN"],
                        "Regular Wages" : ["NaN","NaN","NaN","NaN","NaN"],
                        "Overtime Wages" : ["NaN","NaN","NaN","NaN","NaN"]
                  } )

Lets say I have static lists used as reference to determine which column the values should be transformed into.

GrossHours = ['1','2','3']

RegularWages = ['1','3','4']

OvertimeWages = ['2','OCH']

The desired result will be this DataFrame

df_result = pd.DataFrame( {
                        "Check Date" : ["2019-01-04","2019-01-18"],
                        "Pay Start Date" : ["2018-12-15","2018-12-29"],
                        "Pay End Date" : ["2018-12-28","2019-01-11"],
                        "Hours" : [232,19.5],
                        "Wages" : [2196.53,530.38],
                        "Gross Hours" : [64.0,19.5],
                        "Regular Wages" : [2133.2,530.38],
                        "Overtime Wages" : [63.33,"NaN"]
                  } )

What am I trying? I've tried applying tons of lambda funtions to df1 that give me results as desired, but I'm not certain how to get these resulting objects back to the original DataFrame df1 cleanly. Is the only option to make a bunch of intermediary DataFrames that are then Joined or Merged back onto the original which is then groupby 'ed again?

g1 = df1.groupby(["Check Date"])

g1.apply(lambda x: x[x['Pay Code'].isin(GrossHours)]['Hours'].astype(float).sum())

Check Date
2019-01-04    64.0
2019-01-18    19.5
dtype: float64

First, I set up a list of tuples to iterate over.

transformations = [('Gross_Hours', ['1','2','3']), ('Regular_Wages', ['1','3','4']), ('Overtime_Wages', ['2','OCH'])]

I also defined the structure of the output dataframe I expect.

result_dataframe_fields = ['Check Date', 'Pay Start Date','Pay End Date','Gross Hours', 'Regular Wages', 'Overtime Wages']

By applying a suggestion by @Datanovice to a which was similar to the path I was already down, I ended up with this which is about as clean and readable as I can get it.

# Instatiate result dataframe
df_result = df1.groupby(result_dataframe_fields).sum().reset_index()

for t_ix, t_list in transformations:
    # Create aggregated set to populate result dataframe
    if t_ix == 'Gross_Hours':
        g1 = df1.loc[df1['Pay Code'].isin(t_list)].groupby('Check Date')['Hours'].agg(temp_col_name='sum')
        g2 = g1.reset_index()
        g2.columns = ['Check Date', t_ix]
    else:
        g1 = df1.loc[df1['Pay Code'].isin(t_list)].groupby('Check Date')['Wages'].agg(temp_col_name='sum')
        g2 = g1.reset_index()
        g2.columns = ['Check Date', t_ix]

    #Handle the .agg() column naming limitation (no spaces on list agg)
    colsg2 = g2.columns
    colsg2 = colsg2.map(lambda x: x.replace('_', ' ') if isinstance(x, (str)) else x)
    g2.columns = colsg2

    # Dataframe copy that will update result dataframe
    update_df = g2.copy()

    df_result.update(update_df)

Result Image From Jupyter Lab

I still hope this isn't the best answer, as my actual application is far larger than this and looks rather hideous blown out to my "real code" scale.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM