[英]Replacing outliers with NaN in pandas dataframe after applying a .groupby() arguement
I would like to remove outliers from a pandas dataframe using the standard deviation for a column variable after applying a groupby function. 我想在应用groupby函数后使用列变量的标准偏差从熊猫数据框中删除离群值。
Here is my data frame: 这是我的数据框:
ARI Flesch Kincaid Speaker Score
0 -2.090000 121.220000 -3.400000 NaN NaN
1 8.276460 64.478573 9.034156 William Dudley 1.670275
2 19.570911 27.362067 17.253580 Janet Yellen -0.604757
3 -2.090000 121.220000 -3.400000 NaN NaN
4 -2.090000 121.220000 -3.400000 NaN NaN
5 20.643483 17.069411 18.394178 Lael Brainard 0.215396
6 -2.090000 121.220000 -3.400000 NaN NaN
7 -2.090000 121.220000 -3.400000 NaN NaN
8 12.624198 52.220468 11.403157 Jerome H. Powell -1.350798
9 18.466305 35.186261 16.205693 Stanley Fischer 0.522121
10 -2.090000 121.220000 -3.400000 NaN NaN
11 16.953460 36.246573 15.323457 Lael Brainard -0.217779
12 -2.090000 121.220000 -3.400000 NaN NaN
13 -2.090000 121.220000 -3.400000 NaN NaN
14 17.066088 32.592551 16.108486 Stanley Fischer 0.642245
15 -2.090000 121.220000 -3.400000 NaN NaN
I would like to first group the dataframe by 'Speaker' and then remove 'ARI', 'Flesch', and 'Kincaid' values that outliers as defined by being more than 3 standard deviations from the mean of the scores for the specific feature. 我想先按“扬声器”对数据框进行分组,然后删除“ ARI”,“ Flesch”和“ Kincaid”值,这些值与特定功能的得分平均值相差超过3个标准差而定义为离群值。
Please let me know if this is possible. 如果可以的话,请告诉我。 Thanks! 谢谢!
The only required dependency for this approach is Pandas 这种方法唯一需要依赖的是熊猫
Suppose we have replaced the 'Speaker' columns values 'NaN' with something representative like 'CommitteOrganization' 假设我们已经用“ CommitteOrganization”之类的代表替换了“ Speaker”列中的值“ NaN”
speaker = dataset['Speaker'].fillna(value='CommitteeOrganization') dataset['Speaker'] = speaker
So we have our data like: 因此,我们的数据如下:
Index ARI Flesch Kincaid Speaker Score
0 -2.090000 121.220000 -3.400000 CommitteeOrganization NaN
1 8.276460 64.478573 9.034156 WilliamDudley 1.670275
2 19.570911 27.362067 17.253580 JanetYellen -0.604757
3 -2.090000 121.220000 -3.400000 CommitteeOrganization NaN
4 -2.090000 121.220000 -3.400000 CommitteeOrganization NaN
Group by with the Pandas function: 通过 熊猫功能分组 :
datasetGrouped = dataset.groupby(by='Speaker').mean()
So we have our data like: 因此,我们的数据如下:
Speaker ARI Flesch Kincaid Score
CommitteeOrganization -2.090000 121.220000 -3.400000 NaN
JanetYellen 19.570911 27.362067 17.253580 -0.604757
JeromeH.Powell 12.624198 52.220468 11.403157 -1.350798
LaelBrainard 18.798471 26.657992 16.858818 -0.001191
StanleyFischer 17.766196 33.889406 16.157089 0.582183
WilliamDudley 8.276460 64.478573 9.034156 1.670275
Compute the Standard Deviations for each columns: 计算每列的标准偏差:
aristd = datasetGrouped['ARI'].std()
fleschstd = datasetGrouped['Flesch'].std()
kincaidstd = datasetGrouped['Kincaid'].std()
Replace the values with 'NaN' on the rows that meets the condition: 在满足条件的行上将值替换为“ NaN”:
datasetGrouped.loc[abs(datasetGrouped.ARI) > aristd*3,'ARI'] = 'NaN'
datasetGrouped.loc[abs(datasetGrouped.Flesch) > fleschstd*3,'Flesch'] = 'NaN'
datasetGrouped.loc[abs(datasetGrouped.Kincaid) > kincaidstd*3,'Kincaid'] = 'NaN'
The final dataset is: 最终的数据集是:
Speaker ARI Flesch Kincaid Score
CommitteeOrganization -2.090000 NaN -3.400000 NaN
JanetYellen 19.570911 27.3621 17.253580 -0.604757
JeromeH.Powell 12.624198 52.2205 11.403157 -1.350798
LaelBrainard 18.798471 26.658 16.858818 -0.001191
StanleyFischer 17.766196 33.8894 16.157089 0.582183
WilliamDudley 8.276460 64.4786 9.034156 1.670275
Full code available on: Github 可用的完整代码: Github
Note: This could be done in less code than presented, but the answer it's done "step by step" for easy understanding. 注意:这可以用比所提供的代码少的代码来完成,但是答案是“逐步”完成的,以便于理解。
Note2: Because the question was a little ambiguous, if I didn't understand well something and don't provide the right answer, don't hesitate to tell me and I'll update the answer if possible 注意2:由于问题有点模棱两可,如果我对某些事情不太了解,并且没有提供正确的答案,请随时告诉我,如有可能,我会更新答案
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.