简体   繁体   English

在 R 中按列和因子删除异常值行

[英]Remove outlier rows by column and factor in R

I am working with a data-frame in R.我正在使用 R 中的数据框。 I have the following function which removes all rows of a data-frame df where, for a specified column index/attribute, the value at that row is outside mean (of column) plus or minus n*stdev (of column).我有以下 function 删除数据帧df的所有行,其中,对于指定的列索引/属性,该行的值超出平均值(列)加或减 n*stdev(列)。

remove_outliers <- function(df,attr,n){
  outliersgone <- df[df[,attr]<=(mean(df[,attr],na.rm=TRUE)+n*sd(df[,attr],na.rm=TRUE)) & df[,attr]>=(mean(df[,attr],na.rm=TRUE)-n*sd(df[,attr],na.rm=TRUE)),]
  return(outliersgone)
}

There are two parts to my question.我的问题有两个部分。

(1) My data-frame df also has a column 'Group', which specifies a class label. (1) 我的数据框df也有一个“组”列,它指定了一个 class label。 I would like to be able to remove outliers according to mean and standard deviation within their group within the column, ie organised by factor (within the column).我希望能够根据列内组内的均值和标准差去除异常值,即按因子组织(列内)。 So you would remove from the data-frame a row labelled with group A if, in the specified column/attribute, the value at that row is outside mean (of group A rows in that column) plus/minus n*stdev (of group A rows in that column).因此,如果在指定的列/属性中,该行的值超出平均值(该列中的 A 组行)加/减 n*stdev(组该列中的一行)。 And the same for groups B, C, D, E, F, etc.对于 B 组、C、D、E、F 等也是如此。

How can I do this?我怎样才能做到这一点? (Preferably using only base R and dplyr.) I have tried to use df %>% group_by(Group) followed by mutate but I'm not sure what to pass to mutate, given my function remove_outliers seems to require the whole data-frame to be passed into it (so it can return the whole data-frame with rows only removed based on the chosen attribute attr ). (Preferably using only base R and dplyr.) I have tried to use df %>% group_by(Group) followed by mutate but I'm not sure what to pass to mutate, given my function remove_outliers seems to require the whole data-frame传递给它(因此它可以返回整个数据框,其中仅根据所选属性attr删除行)。

I am open to hearing suggestions for changing the function remove_outliers as well, as long as they also return the whole data-frame as explained.我愿意听取有关更改 function remove_outliers的建议,只要它们还按照说明返回整个数据帧。 I'd prefer solutions that avoid loops if possible (unless inevitable and no more efficient method presents itself in base R / dplyr).如果可能的话,我更喜欢避免循环的解决方案(除非不可避免且没有更有效的方法出现在基础 R / dplyr 中)。

(2) Is there a straightforward way I could combine outlier considerations across multiple columns? (2) 有没有一种直接的方法可以跨多个列结合异常值考虑? eg remove from the dataframe df those rows which are outliers wrt at least $N$ attributes out of a specified vector of attributes/column indices (length≥N).例如,从 dataframe df中删除那些从指定的属性/列索引向量(长度≥N)中至少具有 $N$ 个属性的异常值的行。 or a more complex condition like, remove from the dataframe df those rows which are outliers wrt Attribute 1 and at least 2 of Attributes 2,4,6,8.或更复杂的条件,例如,从 dataframe df中删除属性 1属性 2、4、6、8 中的至少 2 个异常值的那些行。

(Ideally the definition of outlier would again be within-group within column, as specified in question 1 above, but a solution working in terms of just within column without considering the groups would also be useful for me.) (理想情况下,异常值的定义将再次在列内组内,如上面问题 1 中所述,但是在不考虑组的情况下仅在列内工作的解决方案对我也很有用。)

Ok - part 1 (and trying to avoid loops wherever possible):好的 - 第 1 部分(并尽可能避免循环):

Here's some test data:下面是一些测试数据:

test_data=data.frame(
    group=c(rep("a",100),rep("b",100)),
    value=rnorm(200)
)

We'll find the groups:我们会找到这些组:

groups=levels(test_data[,1]) # or unique(test_data[,1]) if it isn't a factor

And we'll calculate the outlier limits (here I'm specifying only 1 sd) - sorry for the loop, but it's only over the groups, not the data:我们将计算异常值限制(这里我只指定 1 sd) - 对不起循环,但它只针对组,而不是数据:

outlier_sds=1
outlier_limits=sapply(groups,function(g) {
    m=mean(test_data[test_data[,1]==g,2])
    s=sd(test_data[test_data[,1]==g,2])
    return(c(m-outlier_sds*s,m+outlier_sds*s))
})

So we can define the limits for each row of test_data :所以我们可以为test_data的每一行定义限制:

test_data_limits=outlier_limits[,test_data[,1]]

And use this to determine the outliers:并使用它来确定异常值:

outliers=test_data[,2]<test_data_limits[1,] | test_data[,2]>test_data_limits[2,]

(or, combining those last steps): (或者,结合最后的步骤):

outliers=test_data[,2]<outlier_limits[1,test_data[,1]] | test_data[,2]>outlier_limits[2,test_data[,1]]

Finally:最后:

test_data_without_outliers=test_data[!outliers,]

EDIT: now part 2 (apply part 1 with a loop over all the columns in the data):编辑:现在是第 2 部分(应用第 1 部分,循环遍历数据中的所有列):

Some test data with more than one column of values:一些具有多于一列值的测试数据:

test_data2=data.frame(
    group=c(rep("a",100),rep("b",100)),
    value1=rnorm(200),
    value2=2*rnorm(200),
    value3=3*rnorm(200)
)

Combine all the steps of part 1 into a new function find_outliers that returns a logical vector indicating whether any value is an outlier for its respective column & group:将第 1 部分的所有步骤组合成一个新的 function find_outliers ,它返回一个逻辑向量,指示任何值是否是其各自列和组的异常值:

find_outliers = function(values,n_sds,groups) {
    group_names=levels(groups)
    outlier_limits=sapply(group_names,function(g) {
        m=mean(values[groups==g])
        s=sd(values[groups==g])
        return(c(m-n_sds*s,m+n_sds*s))
    })
    return(values < outlier_limits[1,groups] | values > outlier_limits[2,groups])
}

And then apply this function to each of the data columns:然后将此 function 应用于每个数据列:

test_groups=test_data2[,1]
test_data_outliers=apply(test_data2[,-1],2,function(d) find_outliers(values=d,n_sds=1,groups=test_groups))

The rowSums of test_data_outliers indicate how many times each row is considered an 'outlier' in the various columns, with respect to its own group: rowSumstest_data_outliers表示每行在各个列中被视为“异常值”的次数,相对于其自己的组:

rowSums(test_data_outliers)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM