简体   繁体   English

在 Pandas Dataframe 中分组和有条件地转换数据的最简洁方法是什么?

[英]What is the cleanest way to group by and conditionally transform data within a Pandas Dataframe?

I am using pandas (0.25.3) and Python (3.7.4).我正在使用 pandas (0.25.3) 和 Python (3.7.4)。 I am working with a DataFrame similar to df1 below.我正在使用类似于下面的df1的 DataFrame。 I need to transform the "Hours" and "Wages" fields into the "Gross Hours", "Gross Wages", "Regular Wages" fields conditionally based on the value of the "Pay Code" field in the same DataFrame.我需要根据同一 DataFrame 中“工资代码”字段的值,有条件地将“小时”和“工资”字段转换为“总小时数”、“总工资”、“常规工资”字段。 I also need to group by "Check Date".我还需要按“检查日期”分组。

df1 = pd.DataFrame( {
                        "Pay Code" : ["1","4","OCH","3","3"],
                        "Check Date" : ["2019-01-04","2019-01-04","2019-01-04","2019-01-04","2019-01-18"],
                        "Pay Start Date" : ["2018-12-15","2018-12-15","2018-12-15","2018-12-15","2018-12-29"],
                        "Pay End Date" : ["2018-12-28","2018-12-28","2018-12-28","2018-12-28","2019-01-11"],
                        "Pay Code Description" : ["REGULAR PAY","HOLIDAY PAY","ON CALL HOURLY","VACATION PAY","VACATION PAY"],
                        "Hours" : [46.0,16.0,152.0,18.0,19.5],
                        "Wages" : [1226.58,426.64,63.33,479.98,530.38],
                        "Gross Hours" : ["NaN","NaN","NaN","NaN","NaN"],
                        "Regular Wages" : ["NaN","NaN","NaN","NaN","NaN"],
                        "Overtime Wages" : ["NaN","NaN","NaN","NaN","NaN"]
                  } )

Lets say I have static lists used as reference to determine which column the values should be transformed into.假设我有 static 列表用作参考来确定应该将值转换到哪一列。

GrossHours = ['1','2','3']

RegularWages = ['1','3','4']

OvertimeWages = ['2','OCH']

The desired result will be this DataFrame期望的结果将是这个 DataFrame

df_result = pd.DataFrame( {
                        "Check Date" : ["2019-01-04","2019-01-18"],
                        "Pay Start Date" : ["2018-12-15","2018-12-29"],
                        "Pay End Date" : ["2018-12-28","2019-01-11"],
                        "Hours" : [232,19.5],
                        "Wages" : [2196.53,530.38],
                        "Gross Hours" : [64.0,19.5],
                        "Regular Wages" : [2133.2,530.38],
                        "Overtime Wages" : [63.33,"NaN"]
                  } )

What am I trying?我在尝试什么? I've tried applying tons of lambda funtions to df1 that give me results as desired, but I'm not certain how to get these resulting objects back to the original DataFrame df1 cleanly.我已经尝试将大量 lambda 函数应用到df1以提供所需的结果,但我不确定如何将这些结果对象干净地恢复到原始 DataFrame df1。 Is the only option to make a bunch of intermediary DataFrames that are then Joined or Merged back onto the original which is then groupby 'ed again?是制作一堆中间数据帧的唯一选择,然后将这些数据帧加入或合并回原始数据帧,然后再次进行groupby编辑?

g1 = df1.groupby(["Check Date"])

g1.apply(lambda x: x[x['Pay Code'].isin(GrossHours)]['Hours'].astype(float).sum())

Check Date
2019-01-04    64.0
2019-01-18    19.5
dtype: float64

First, I set up a list of tuples to iterate over.首先,我设置了一个要迭代的元组列表。

transformations = [('Gross_Hours', ['1','2','3']), ('Regular_Wages', ['1','3','4']), ('Overtime_Wages', ['2','OCH'])]

I also defined the structure of the output dataframe I expect.我还定义了我期望的 output dataframe 的结构。

result_dataframe_fields = ['Check Date', 'Pay Start Date','Pay End Date','Gross Hours', 'Regular Wages', 'Overtime Wages']

By applying a suggestion by @Datanovice to a which was similar to the path I was already down, I ended up with this which is about as clean and readable as I can get it.通过将@Datanovice 的建议应用到与我已经走下的路径相似的路径上,我最终得到了这个尽可能干净和可读的路径。

# Instatiate result dataframe
df_result = df1.groupby(result_dataframe_fields).sum().reset_index()

for t_ix, t_list in transformations:
    # Create aggregated set to populate result dataframe
    if t_ix == 'Gross_Hours':
        g1 = df1.loc[df1['Pay Code'].isin(t_list)].groupby('Check Date')['Hours'].agg(temp_col_name='sum')
        g2 = g1.reset_index()
        g2.columns = ['Check Date', t_ix]
    else:
        g1 = df1.loc[df1['Pay Code'].isin(t_list)].groupby('Check Date')['Wages'].agg(temp_col_name='sum')
        g2 = g1.reset_index()
        g2.columns = ['Check Date', t_ix]

    #Handle the .agg() column naming limitation (no spaces on list agg)
    colsg2 = g2.columns
    colsg2 = colsg2.map(lambda x: x.replace('_', ' ') if isinstance(x, (str)) else x)
    g2.columns = colsg2

    # Dataframe copy that will update result dataframe
    update_df = g2.copy()

    df_result.update(update_df)

Result Image From Jupyter Lab来自 Jupyter 实验室的结果图像

I still hope this isn't the best answer, as my actual application is far larger than this and looks rather hideous blown out to my "real code" scale.我仍然希望这不是最好的答案,因为我的实际应用程序比这大得多,而且看起来相当可怕,超出了我的“真实代码”规模。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM