Following on from this question: python - Group by and add new row which is calculation of other rows
I have a pandas dataframe as follows:
col_1 col_2 col_3 col_4
a X 5 1
a Y 3 2
a Z 6 4
b X 7 8
b Y 4 3
b Z 6 5
And I want to, for each value in col_1, apply a function with the values in col_3 and col_4 (and many more columns) that correspond to X and Z from col_2 and create a new row with these values. So the output would be as below:
col_1 col_2 col_3 col_4
a X 5 1
a Y 3 2
a Z 6 4
a NEW * *
b X 7 8
b Y 4 3
b Z 6 5
b NEW * *
Where *
are the outputs of the function.
Original question (which only requires a simple addition) was answered with:
new = df[df.col_2.isin(['X', 'Z'])]\
.groupby(['col_1'], as_index=False).sum()\
.assign(col_2='NEW')
df = pd.concat([df, new]).sort_values('col_1')
I'm now looking for a way to use a custom function, such as (X/Y)
or ((X+Y)*2)
, rather than X+Y
. How can I modify this code to work with my new requirements?
I'm not sure if this is what you're looking for, but here goes:
def f(x):
y = x.values
return y[0] / y[1] # replace with your function
And, the change to new
is:
new = (
df[df.col_2.isin(['X', 'Z'])]
.groupby(['col_1'], as_index=False)[['col_3', 'col_4']]
.agg(f)
.assign(col_2='NEW')
)
col_1 col_3 col_4 col_2
0 a 0.833333 0.25 NEW
1 b 1.166667 1.60 NEW
df = pd.concat([df, new]).sort_values('col_1')
df
col_1 col_2 col_3 col_4
0 a X 5.000000 1.00
1 a Y 3.000000 2.00
2 a Z 6.000000 4.00
0 a NEW 0.833333 0.25
3 b X 7.000000 8.00
4 b Y 4.000000 3.00
5 b Z 6.000000 5.00
1 b NEW 1.166667 1.60
I'm taking a leap of faith in f
and assuming those columns are sorted before they hit the function. If this isn't the case, an additional sort_values
call is needed:
df = df.sort_values(['col_1, 'col_2'])
Should do the trick.
def foo(df):
# Expand variables into dictionary.
d = {v: df.loc[df['col_2'] == v, ['col_3', 'col_4']] for v in df['col_2'].unique()}
# Example function: (X + Y ) * 2
result = (d['X'].values + d['Y'].values) * 2
# Convert result to a new dataframe row.
result = result.tolist()[0]
df_new = pd.DataFrame(
{'col_1': [df['col_1'].iat[0]],
'col_2': ['NEW'],
'col_3': result[0],
'col_4': result[1]})
# Concatenate result with original dataframe for group and return.
return pd.concat([df, df_new])
>>> df.groupby('col_1').apply(lambda x: foo(x)).reset_index(drop=True)
col_1 col_2 col_3 col_4
0 a X 5 1
1 a Y 3 2
2 a Z 6 4
3 a NEW 16 6
4 b X 7 8
5 b Y 4 3
6 b Z 6 5
7 b NEW 22 22
一种较新的方法(应该提供性能优势)是使用 PyArrow 和 pandas_udf 来支持矢量化操作,如 Spark 2.4: PySpark 使用 Apache Arrow 的 PySpark 使用指南中所述
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.