简体   繁体   中英

Understanding the execution of DataFrame in python

I am new to python and i want to understand how the execution takes place in a DataFrame. let's try this with an example from the dataset found in the kaggle.com( Titanic: Machine Learning from Disaster ). I wanted to replace the NaN value with the mean() for the respective sex . ie. the NaN value for Men should be replaced by the mean of the mens age and vice versa. now i achieved this by using this line of code

_data['new_age']=_data['new_age'].fillna(_data.groupby('Sex')['Age'].transform('mean'))

my question is, while executing the code, how does the line knows that this particular row belongs to male and the NaN value should be replaced by the male mean() and female value should be replaced by the female mean() .

在此处输入图像描述

It's because of groupby + transform . When you group with an aggregation that returns a scalar per group a normal groupby collapses to a single row for each unique grouping key.

np.random.seed(42)
df = pd.DataFrame({'Sex': list('MFMMFFMMFM'),
                   'Age': np.random.choice([1, 10, 11, 13, np.NaN], 10)},
                   index=list('ABCDEFGHIJ'))
df.groupby('Sex')['Age'].mean()

#Sex
#F    10.5                # One F row
#M    11.5                # One M row
#Name: Age, dtype: float64

Using transform will broadcast this result back to the original index based on the group that row belonged to.

df.groupby('Sex')['Age'].transform('mean')

#A    11.5  # Belonged to M
#B    10.5  # Belonged to F
#C    11.5  # Belonged to M
#D    11.5
#E    10.5
#F    10.5
#G    11.5
#H    11.5
#I    10.5
#J    11.5
#Name: Age, dtype: float64

To make it crystal clear, I'll assign the transformed result back, and now you can see how .fillna gets the correct mean.

df['Sex_mean'] = df.groupby('Sex')['Age'].transform('mean')

  Sex   Age  Sex_mean
A   M  13.0      11.5
B   F   NaN      10.5  # NaN will be filled with 10.5
C   M  11.0      11.5
D   M   NaN      11.5  # NaN will be filled with 11.5
E   F   NaN      10.5  # Nan will be filled with 10.5
F   F  10.0      10.5
G   M  11.0      11.5
H   M  11.0      11.5
I   F  11.0      10.5
J   M   NaN      11.5  # Nan will be filled with 11.5

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM