简体   繁体   中英

Groupby and perform row-wise calculation using a custom function

Following on from this question: python - Group by and add new row which is calculation of other rows

I have a pandas dataframe as follows:

col_1   col_2   col_3  col_4
a       X        5      1
a       Y        3      2
a       Z        6      4
b       X        7      8
b       Y        4      3
b       Z        6      5

And I want to, for each value in col_1, apply a function with the values in col_3 and col_4 (and many more columns) that correspond to X and Z from col_2 and create a new row with these values. So the output would be as below:

col_1   col_2   col_3  col_4 
a       X        5      1
a       Y        3      2
a       Z        6      4
a       NEW      *      *
b       X        7      8
b       Y        4      3
b       Z        6      5
b       NEW      *      *

Where * are the outputs of the function.

Original question (which only requires a simple addition) was answered with:

new = df[df.col_2.isin(['X', 'Z'])]\
  .groupby(['col_1'], as_index=False).sum()\
  .assign(col_2='NEW')

df = pd.concat([df, new]).sort_values('col_1')

I'm now looking for a way to use a custom function, such as (X/Y) or ((X+Y)*2) , rather than X+Y . How can I modify this code to work with my new requirements?

I'm not sure if this is what you're looking for, but here goes:

def f(x):
    y = x.values
    return y[0] / y[1] # replace with your function

And, the change to new is:

new = (
    df[df.col_2.isin(['X', 'Z'])]
      .groupby(['col_1'], as_index=False)[['col_3', 'col_4']]
      .agg(f)
      .assign(col_2='NEW')
)

  col_1     col_3  col_4 col_2
0     a  0.833333   0.25   NEW
1     b  1.166667   1.60   NEW

df = pd.concat([df, new]).sort_values('col_1')

df
  col_1 col_2     col_3  col_4
0     a     X  5.000000   1.00
1     a     Y  3.000000   2.00
2     a     Z  6.000000   4.00
0     a   NEW  0.833333   0.25
3     b     X  7.000000   8.00
4     b     Y  4.000000   3.00
5     b     Z  6.000000   5.00
1     b   NEW  1.166667   1.60

I'm taking a leap of faith in f and assuming those columns are sorted before they hit the function. If this isn't the case, an additional sort_values call is needed:

df = df.sort_values(['col_1, 'col_2'])

Should do the trick.

def foo(df):
    # Expand variables into dictionary.
    d = {v: df.loc[df['col_2'] == v, ['col_3', 'col_4']] for v in df['col_2'].unique()}

    # Example function: (X + Y ) * 2
    result = (d['X'].values + d['Y'].values) * 2

    # Convert result to a new dataframe row.
    result = result.tolist()[0]
    df_new = pd.DataFrame(
        {'col_1': [df['col_1'].iat[0]], 
         'col_2': ['NEW'], 
         'col_3': result[0],
         'col_4': result[1]})
    # Concatenate result with original dataframe for group and return.
    return pd.concat([df, df_new])

>>> df.groupby('col_1').apply(lambda x: foo(x)).reset_index(drop=True)
  col_1 col_2  col_3  col_4
0     a     X      5      1
1     a     Y      3      2
2     a     Z      6      4
3     a   NEW     16      6
4     b     X      7      8
5     b     Y      4      3
6     b     Z      6      5
7     b   NEW     22     22

一种较新的方法(应该提供性能优势)是使用 PyArrow 和 pandas_udf 来支持矢量化操作,如 Spark 2.4: PySpark 使用 Apache Arrow 的 PySpark 使用指南中所述

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM