从熊猫数据框中按特定列的行中检测离群值

Question

I have datasets which measure voltage values in certain column. 我有测量特定列中电压值的数据集。 I'm looking for elegant way to extract the rows that is deviated from mean value. 我正在寻找一种优雅的方法来提取偏离均值的行。 There are couple of group in "volt_id" and I'd like to have each group create their own mean/std and use them to decide which rows are deviated from each group. “ volt_id”中有几个组，我想让每个组创建自己的均值/标准差，并使用它们来确定哪些行与每个组不同。 for example, I have original dataset as below. 例如，我有如下原始数据集。

      time     volt_id     value
 0    14         A         300.00
 1    15         A         310.00
 2    15         B         200.00
 3    16         B         210.00
 4    17         B         300.00
 5    14         C         100.00
 6    16         C         110.00
 7    20         C         200.00

After the algorithm running, I'd only keep row 4 and 7 which is highly deviated from their groups as below. 算法运行后，我只保留第4行和第7行，它们与下面的组有很大出入。

      time     volt_id     value
 4    17         B         300.00
 7    20         C         200.00

I could do this if there is only single group but my codes would be messy and lengthy if do this for multiple groups. 如果只有一个组，则可以执行此操作，但是如果对多个组执行此操作，则我的代码将很冗长。 I'd appreciate if there's simpler way to do this. 如果有更简单的方法，我将不胜感激。

thanks, 谢谢，

Answer 1

You can compute and filter on the zscore on each group using groupby . 您可以使用groupby在每个group的zscore上进行计算和过滤。

Assuming you want only those rows which are 1 or more standard deviations away from mean, 假设您只希望那些与均值相差1个或更多标准偏差的行，

g = df.groupby('volt_id').value
v = (df.value - g.transform('mean')) / g.transform('std')

df[v.abs().ge(1)]

   time volt_id  value
4    17       B  300.0
7    20       C  200.0

Answer 2

Similar to @COLDSPEED's solution: 类似于@COLDSPEED的解决方案：

In [179]: from scipy.stats import zscore

In [180]: df.loc[df.groupby('volt_id')['value'].transform(zscore) > 1]
Out[180]:
   time volt_id  value
4    17       B  300.0
7    20       C  200.0

Answer 3

One way to do this would be using outliers: http://www.mathwords.com/o/outlier.htm 一种方法是使用异常值： http : //www.mathwords.com/o/outlier.htm

You would need to define your inner quartile range and first and third quartiles. 您将需要定义内部四分位数范围以及第一和第三四分位数。 You could then filter your data onsimple comparison. 然后，您可以通过简单的比较过滤数据。

Quartiles are not the only way to determine outliers howevet. 四分位数不是确定离群值的唯一方法。 Heres a discussion comparing standard deviation and quartiles for locating outliers: https://stats.stackexchange.com/questions/175999/determine-outliers-using-iqr-or-standard-deviation 以下是讨论比较标准偏差和四分位数以查找异常值的讨论： https : //stats.stackexchange.com/questions/175999/determine-outliers-using-iqr-or-standard-deviation

从熊猫数据框中按特定列的行中检测离群值

问题描述

3 个解决方案

解决方案1
2 已采纳 2018-03-19 22:48:16

解决方案2
1 2018-03-19 23:23:28

解决方案3
0 2018-03-19 22:54:54

从熊猫数据框中按特定列的行中检测离群值

问题描述

3 个解决方案

解决方案1 2 已采纳 2018-03-19 22:48:16

解决方案2 1 2018-03-19 23:23:28

解决方案3 0 2018-03-19 22:54:54

解决方案1
2 已采纳 2018-03-19 22:48:16

解决方案2
1 2018-03-19 23:23:28

解决方案3
0 2018-03-19 22:54:54