应用.groupby（）争论后，在熊猫数据框中用NaN替换离群值

Question

I would like to remove outliers from a pandas dataframe using the standard deviation for a column variable after applying a groupby function. 我想在应用groupby函数后使用列变量的标准偏差从熊猫数据框中删除离群值。

Here is my data frame: 这是我的数据框：

            ARI      Flesch    Kincaid             Speaker     Score
0     -2.090000  121.220000  -3.400000                 NaN       NaN   
1      8.276460   64.478573   9.034156      William Dudley  1.670275   
2     19.570911   27.362067  17.253580        Janet Yellen -0.604757   
3     -2.090000  121.220000  -3.400000                 NaN       NaN   
4     -2.090000  121.220000  -3.400000                 NaN       NaN   
5     20.643483   17.069411  18.394178       Lael Brainard  0.215396   
6     -2.090000  121.220000  -3.400000                 NaN       NaN   
7     -2.090000  121.220000  -3.400000                 NaN       NaN   
8     12.624198   52.220468  11.403157    Jerome H. Powell -1.350798   
9     18.466305   35.186261  16.205693     Stanley Fischer  0.522121   
10    -2.090000  121.220000  -3.400000                 NaN       NaN   
11    16.953460   36.246573  15.323457       Lael Brainard -0.217779   
12    -2.090000  121.220000  -3.400000                 NaN       NaN   
13    -2.090000  121.220000  -3.400000                 NaN       NaN   
14    17.066088   32.592551  16.108486     Stanley Fischer  0.642245   
15    -2.090000  121.220000  -3.400000                 NaN       NaN

I would like to first group the dataframe by 'Speaker' and then remove 'ARI', 'Flesch', and 'Kincaid' values that outliers as defined by being more than 3 standard deviations from the mean of the scores for the specific feature. 我想先按“扬声器”对数据框进行分组，然后删除“ ARI”，“ Flesch”和“ Kincaid”值，这些值与特定功能的得分平均值相差超过3个标准差而定义为离群值。

Please let me know if this is possible. 如果可以的话，请告诉我。 Thanks! 谢谢！

Answer 1

The only required dependency for this approach is Pandas 这种方法唯一需要依赖的是熊猫

Suppose we have replaced the 'Speaker' columns values 'NaN' with something representative like 'CommitteOrganization' 假设我们已经用“ CommitteOrganization”之类的代表替换了“ Speaker”列中的值“ NaN”

speaker = dataset['Speaker'].fillna(value='CommitteeOrganization') dataset['Speaker'] = speaker

So we have our data like: 因此，我们的数据如下：

Index ARI   Flesch  Kincaid Speaker Score
0   -2.090000   121.220000  -3.400000   CommitteeOrganization   NaN
1   8.276460    64.478573   9.034156    WilliamDudley   1.670275
2   19.570911   27.362067   17.253580   JanetYellen -0.604757
3   -2.090000   121.220000  -3.400000   CommitteeOrganization   NaN
4   -2.090000   121.220000  -3.400000   CommitteeOrganization   NaN

Group by with the Pandas function: 通过熊猫功能分组：

datasetGrouped = dataset.groupby(by='Speaker').mean()

So we have our data like: 因此，我们的数据如下：

Speaker             ARI Flesch  Kincaid Score
CommitteeOrganization   -2.090000   121.220000  -3.400000   NaN
JanetYellen 19.570911   27.362067   17.253580   -0.604757
JeromeH.Powell  12.624198   52.220468   11.403157   -1.350798
LaelBrainard    18.798471   26.657992   16.858818   -0.001191
StanleyFischer  17.766196   33.889406   16.157089   0.582183
WilliamDudley   8.276460    64.478573   9.034156    1.670275

Compute the Standard Deviations for each columns: 计算每列的标准偏差：

aristd = datasetGrouped['ARI'].std()
fleschstd = datasetGrouped['Flesch'].std()
kincaidstd = datasetGrouped['Kincaid'].std()

Replace the values with 'NaN' on the rows that meets the condition: 在满足条件的行上将值替换为“ NaN”：

datasetGrouped.loc[abs(datasetGrouped.ARI) > aristd*3,'ARI'] = 'NaN'
datasetGrouped.loc[abs(datasetGrouped.Flesch) > fleschstd*3,'Flesch'] = 'NaN'
datasetGrouped.loc[abs(datasetGrouped.Kincaid) > kincaidstd*3,'Kincaid'] = 'NaN'

The final dataset is: 最终的数据集是：

Speaker             ARI Flesch  Kincaid Score
CommitteeOrganization   -2.090000   NaN -3.400000   NaN
JanetYellen 19.570911   27.3621 17.253580   -0.604757
JeromeH.Powell  12.624198   52.2205 11.403157   -1.350798
LaelBrainard    18.798471   26.658  16.858818   -0.001191
StanleyFischer  17.766196   33.8894 16.157089   0.582183
WilliamDudley   8.276460    64.4786 9.034156    1.670275

Full code available on: Github 可用的完整代码： Github

Note: This could be done in less code than presented, but the answer it's done "step by step" for easy understanding. 注意：这可以用比所提供的代码少的代码来完成，但是答案是“逐步”完成的，以便于理解。

Note2: Because the question was a little ambiguous, if I didn't understand well something and don't provide the right answer, don't hesitate to tell me and I'll update the answer if possible 注意2：由于问题有点模棱两可，如果我对某些事情不太了解，并且没有提供正确的答案，请随时告诉我，如有可能，我会更新答案

应用.groupby（）争论后，在熊猫数据框中用NaN替换离群值

问题描述

1 个解决方案

解决方案1
1 已采纳 2017-06-14 08:45:00

应用.groupby（）争论后，在熊猫数据框中用NaN替换离群值

问题描述

1 个解决方案

解决方案1 1 已采纳 2017-06-14 08:45:00

解决方案1
1 已采纳 2017-06-14 08:45:00