[英]Fill all values in a group with the first non-null value in that group
The following is the pandas dataframe I have:以下是我拥有的熊猫数据框:
cluster Value
1 A
1 NaN
1 NaN
1 NaN
1 NaN
2 NaN
2 NaN
2 B
2 NaN
3 NaN
3 NaN
3 C
3 NaN
4 NaN
4 S
4 NaN
5 NaN
5 A
5 NaN
5 NaN
If we look into the data, cluster 1 has Value 'A' for one row and remain all are NA values.如果我们查看数据,集群 1 的一行具有值“A”,并且仍然是 NA 值。 I want to fill 'A' value for all the rows of cluster 1. Similarly for all the clusters.
我想为集群 1 的所有行填充“A”值。对于所有集群也是如此。 Based on one of the values of the cluster, I want to fill the remaining rows of the cluster.
基于集群的值之一,我想填充集群的剩余行。 The output should be like,
输出应该是这样的,
cluster Value
1 A
1 A
1 A
1 A
1 A
2 B
2 B
2 B
2 B
3 C
3 C
3 C
3 C
4 S
4 S
4 S
5 A
5 A
5 A
5 A
I am new to python and not sure how to proceed with this.我是 python 新手,不知道如何继续。 Can anybody help with this ?
有人可以帮忙吗?
groupby
+ bfill
, and ffill
groupby
+ bfill
和ffill
df = df.groupby('cluster').bfill().ffill()
df
cluster Value
0 1 A
1 1 A
2 1 A
3 1 A
4 1 A
5 2 B
6 2 B
7 2 B
8 2 B
9 3 B
10 3 B
11 3 C
12 3 C
13 4 S
14 4 S
15 4 S
16 5 A
17 5 A
18 5 A
19 5 A
Or,或者,
groupby
+ transform
with first
groupby
+ first
transform
df['Value'] = df.groupby('cluster').Value.transform('first')
df
cluster Value
0 1 A
1 1 A
2 1 A
3 1 A
4 1 A
5 2 B
6 2 B
7 2 B
8 2 B
9 3 B
10 3 B
11 3 C
12 3 C
13 4 S
14 4 S
15 4 S
16 5 A
17 5 A
18 5 A
19 5 A
Edit编辑
The following seems better:以下似乎更好:
nan_map = df.dropna().set_index('cluster').to_dict()['Value']
df['Value'] = df['cluster'].map(nan_map)
print(df)
Original原来的
I can't think of a better way to do this than iterate over all the rows, but one might exist.我想不出比遍历所有行更好的方法,但可能存在一个。 First I built your DataFrame:
首先,我构建了您的 DataFrame:
import pandas as pd
import math
# Build your DataFrame
df = pd.DataFrame.from_items([
('cluster', [1,1,1,1,1,2,2,2,2,3,3,3,3,4,4,4,5,5,5,5]),
('Value', [float('nan') for _ in range(20)]),
])
df['Value'] = df['Value'].astype(object)
df.at[ 0,'Value'] = 'A'
df.at[ 7,'Value'] = 'B'
df.at[11,'Value'] = 'C'
df.at[14,'Value'] = 'S'
df.at[17,'Value'] = 'A'
Now here's an approach that first creates a nan_map
dict, then sets the values in Value
as specified in the dict.现在这里有一种方法,它首先创建一个
nan_map
字典,然后按照字典中的指定设置Value
中的值。
# Create a dict to map clusters to unique values
nan_map = df.dropna().set_index('cluster').to_dict()['Value']
# nan_map: {1: 'A', 2: 'B', 3: 'C', 4: 'S', 5: 'A'}
# Apply
for i, row in df.iterrows():
df.at[i,'Value'] = nan_map[row['cluster']]
print(df)
Output:输出:
cluster Value 0 1 A 1 1 A 2 1 A 3 1 A 4 1 A 5 2 B 6 2 B 7 2 B 8 2 B 9 3 C 10 3 C 11 3 C 12 3 C 13 4 S 14 4 S 15 4 S 16 5 A 17 5 A 18 5 A 19 5 A
Note: This sets all values based on the cluster and doesn't check for NaN-ness.注意:这会根据集群设置所有值,并且不检查 NaN-ness。 You may want to experiment with something like:
您可能想尝试以下方法:
# Apply
for i, row in df.iterrows():
if isinstance(df.at[i,'Value'], float) and math.isnan(df.at[i,'Value']):
df.at[i,'Value'] = nan_map[row['cluster']]
to see which is more efficient (my guess is the former, without the checks).看看哪个更有效(我的猜测是前者,没有检查)。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.