[英]Add a column in a pandas dataframe that is the average of another column based on conditions of other columns
Sorry in advance for the long data table.提前为长数据表道歉。 I do not know a more succinct way to construct the dataframe that I have below.
我不知道构建下面的 dataframe 的更简洁的方法。
I have a pandas DataFrame:我有一个 pandas DataFrame:
data = {'ID': ['A', 'A', 'A', 'A', 'B', 'B', 'B', 'B', 'A', 'A', 'A', 'A', 'B', 'B', 'B', 'B', 'A', 'A', 'A', 'A', 'B', 'B', 'B', 'B', 'A', 'A', 'A', 'A', 'B', 'B', 'B', 'B'],
'Cycle': [1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4],
'Repetition': ['1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '2', '2', '2', '2', '2', '2', '2', '2', '2', '2', '2', '2', '2', '2', '2', '2'],
'Region': ['x', 'x','x','x','x','x','x','x', 'y', 'y', 'y', 'y', 'y', 'y', 'y', 'y', 'x','x','x','x','x','x','x','x', 'y', 'y', 'y', 'y', 'y', 'y', 'y', 'y'],
'Intensity': [34, 89, 34, 45, 34, 56, 78, 65, 45, 45, 34, 56, 34, 56, 56, 66, 56, 78, 23, 45, 42, 56, 86, 5, 33, 44, 78, 89, 34, 42, 34, 66]}
data_df= pd.DataFrame(data)
I would like to add a column that calculates the average intensity when Cycle == 1
for each ID (A and B) and each Region (x and y) and leaves NaN values in all other rows.我想添加一个列,用于计算每个 ID(A 和 B)和每个区域(x 和 y)的
Cycle == 1
时的平均强度,并在所有其他行中保留 NaN 值。 The resulting dataframe would return:生成的 dataframe 将返回:
wanted_data = {'ID': ['A', 'A', 'A', 'A', 'B', 'B', 'B', 'B', 'A', 'A', 'A', 'A', 'B', 'B', 'B', 'B', 'A', 'A', 'A', 'A', 'B', 'B', 'B', 'B', 'A', 'A', 'A', 'A', 'B', 'B', 'B', 'B'],
'Cycle': [1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4],
'Repetition': ['1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '2', '2', '2', '2', '2', '2', '2', '2', '2', '2', '2', '2', '2', '2', '2', '2'],
'Region': ['x', 'x','x','x','x','x','x','x', 'y', 'y', 'y', 'y', 'y', 'y', 'y', 'y', 'x','x','x','x','x','x','x','x', 'y', 'y', 'y', 'y', 'y', 'y', 'y', 'y'],
'Intensity': [34, 89, 34, 45, 34, 56, 78, 65, 45, 45, 34, 56, 34, 56, 56, 66, 56, 78, 23, 45, 42, 56, 86, 5, 33, 44, 78, 89, 34, 42, 34, 66],
'Mean Cycle1 Intensity': [39.5, '', '', '', 34, '', '', '', '', '', '', '', '', '', '', '', 44.5, '', '', '', 38, '', '', '', '', '', '', '', '', '', '', ''] }
wanted_data_df= pd.DataFrame(wanted_data)
I tried adding a function:我尝试添加一个 function:
def meanC1(df):
for i in df['ID'] and j in df['Region']:
if df['Cycle'] == 1:
df['Mean Cycle1 Intensity'] = df['Intensity'].mean()
But this returns,但这又回来了,
ValueError: The truth value of a Series is ambiguous.
ValueError:Series 的真值不明确。 Use a.empty, a.bool(), a.item(), a.any() or a.all()
使用 a.empty、a.bool()、a.item()、a.any() 或 a.all()
Use Series.ne
to create a boolean mask m
, then use Series.mask
to mask the Intensity
column on m
, next use Series.groupby
to group the masked column on ID
and Repetition
and transform
using mean
, finally again use Series.mask
to mask the transformed result:使用
Series.ne
创建一个boolean 掩码m
,然后使用Series.mask
屏蔽m
上的Intensity
列,接下来使用Series.groupby
对ID
和Repetition
上的屏蔽列进行分组,并使用mean
进行transform
,最后再次使用Series.mask
来掩盖转换后的结果:
# Note: Here df refers to `data_df`
m = df['Cycle'].ne(1)
df['Mean Cycle1 Intensity'] = (
df['Intensity'].mask(m)
.groupby([df['ID'], df['Repetition']]).transform('mean').mask(m)
)
Result:结果:
ID Cycle Repetition Region Intensity Mean Cycle1 Intensity
0 A 1 1 x 34 39.5
1 A 2 1 x 89 NaN
2 A 3 1 x 34 NaN
3 A 4 1 x 45 NaN
4 B 1 1 x 34 34.0
5 B 2 1 x 56 NaN
6 B 3 1 x 78 NaN
7 B 4 1 x 65 NaN
8 A 1 1 y 45 39.5
9 A 2 1 y 45 NaN
10 A 3 1 y 34 NaN
11 A 4 1 y 56 NaN
12 B 1 1 y 34 34.0
13 B 2 1 y 56 NaN
14 B 3 1 y 56 NaN
15 B 4 1 y 66 NaN
16 A 1 2 x 56 44.5
17 A 2 2 x 78 NaN
18 A 3 2 x 23 NaN
19 A 4 2 x 45 NaN
20 B 1 2 x 42 38.0
21 B 2 2 x 56 NaN
22 B 3 2 x 86 NaN
23 B 4 2 x 5 NaN
24 A 1 2 y 33 44.5
25 A 2 2 y 44 NaN
26 A 3 2 y 78 NaN
27 A 4 2 y 89 NaN
28 B 1 2 y 34 38.0
29 B 2 2 y 42 NaN
30 B 3 2 y 34 NaN
31 B 4 2 y 66 NaN
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.