简体   繁体   English

应用.groupby()争论后,在熊猫数据框中用NaN替换离群值

[英]Replacing outliers with NaN in pandas dataframe after applying a .groupby() arguement

I would like to remove outliers from a pandas dataframe using the standard deviation for a column variable after applying a groupby function. 我想在应用groupby函数后使用列变量的标准偏差从熊猫数据框中删除离群值。

Here is my data frame: 这是我的数据框:

            ARI      Flesch    Kincaid             Speaker     Score
0     -2.090000  121.220000  -3.400000                 NaN       NaN   
1      8.276460   64.478573   9.034156      William Dudley  1.670275   
2     19.570911   27.362067  17.253580        Janet Yellen -0.604757   
3     -2.090000  121.220000  -3.400000                 NaN       NaN   
4     -2.090000  121.220000  -3.400000                 NaN       NaN   
5     20.643483   17.069411  18.394178       Lael Brainard  0.215396   
6     -2.090000  121.220000  -3.400000                 NaN       NaN   
7     -2.090000  121.220000  -3.400000                 NaN       NaN   
8     12.624198   52.220468  11.403157    Jerome H. Powell -1.350798   
9     18.466305   35.186261  16.205693     Stanley Fischer  0.522121   
10    -2.090000  121.220000  -3.400000                 NaN       NaN   
11    16.953460   36.246573  15.323457       Lael Brainard -0.217779   
12    -2.090000  121.220000  -3.400000                 NaN       NaN   
13    -2.090000  121.220000  -3.400000                 NaN       NaN   
14    17.066088   32.592551  16.108486     Stanley Fischer  0.642245   
15    -2.090000  121.220000  -3.400000                 NaN       NaN 

I would like to first group the dataframe by 'Speaker' and then remove 'ARI', 'Flesch', and 'Kincaid' values that outliers as defined by being more than 3 standard deviations from the mean of the scores for the specific feature. 我想先按“扬声器”对数据框进行分组,然后删除“ ARI”,“ Flesch”和“ Kincaid”值,这些值与特定功能的得分平均值相差超过3个标准差而定义为离群值。

Please let me know if this is possible. 如果可以的话,请告诉我。 Thanks! 谢谢!

The only required dependency for this approach is Pandas 这种方法唯一需要依赖的是熊猫

Suppose we have replaced the 'Speaker' columns values 'NaN' with something representative like 'CommitteOrganization' 假设我们已经用“ CommitteOrganization”之类的代表替换了“ Speaker”列中的值“ NaN”

speaker = dataset['Speaker'].fillna(value='CommitteeOrganization') dataset['Speaker'] = speaker

So we have our data like: 因此,我们的数据如下:

Index ARI   Flesch  Kincaid Speaker Score
0   -2.090000   121.220000  -3.400000   CommitteeOrganization   NaN
1   8.276460    64.478573   9.034156    WilliamDudley   1.670275
2   19.570911   27.362067   17.253580   JanetYellen -0.604757
3   -2.090000   121.220000  -3.400000   CommitteeOrganization   NaN
4   -2.090000   121.220000  -3.400000   CommitteeOrganization   NaN

Group by with the Pandas function: 通过 熊猫功能分组

datasetGrouped = dataset.groupby(by='Speaker').mean()

So we have our data like: 因此,我们的数据如下:

Speaker             ARI Flesch  Kincaid Score
CommitteeOrganization   -2.090000   121.220000  -3.400000   NaN
JanetYellen 19.570911   27.362067   17.253580   -0.604757
JeromeH.Powell  12.624198   52.220468   11.403157   -1.350798
LaelBrainard    18.798471   26.657992   16.858818   -0.001191
StanleyFischer  17.766196   33.889406   16.157089   0.582183
WilliamDudley   8.276460    64.478573   9.034156    1.670275

Compute the Standard Deviations for each columns: 计算每列的标准偏差:

aristd = datasetGrouped['ARI'].std()
fleschstd = datasetGrouped['Flesch'].std()
kincaidstd = datasetGrouped['Kincaid'].std()

Replace the values with 'NaN' on the rows that meets the condition: 在满足条件的行上将值替换为“ NaN”:

datasetGrouped.loc[abs(datasetGrouped.ARI) > aristd*3,'ARI'] = 'NaN'
datasetGrouped.loc[abs(datasetGrouped.Flesch) > fleschstd*3,'Flesch'] = 'NaN'
datasetGrouped.loc[abs(datasetGrouped.Kincaid) > kincaidstd*3,'Kincaid'] = 'NaN'

The final dataset is: 最终的数据集是:

Speaker             ARI Flesch  Kincaid Score
CommitteeOrganization   -2.090000   NaN -3.400000   NaN
JanetYellen 19.570911   27.3621 17.253580   -0.604757
JeromeH.Powell  12.624198   52.2205 11.403157   -1.350798
LaelBrainard    18.798471   26.658  16.858818   -0.001191
StanleyFischer  17.766196   33.8894 16.157089   0.582183
WilliamDudley   8.276460    64.4786 9.034156    1.670275

Full code available on: Github 可用的完整代码: Github

Note: This could be done in less code than presented, but the answer it's done "step by step" for easy understanding. 注意:这可以用比所提供的代码少的代码来完成,但是答案是“逐步”完成的,以便于理解。

Note2: Because the question was a little ambiguous, if I didn't understand well something and don't provide the right answer, don't hesitate to tell me and I'll update the answer if possible 注意2:由于问题有点模棱两可,如果我对某些事情不太了解,并且没有提供正确的答案,请随时告诉我,如有可能,我会更新答案

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM