简体   繁体   中英

Removing outliers of different lengths from different columns of a dataframe using R

I have a large dataframe. I want to remove the outliers from each column of my dataframe inferred from boxplots. Here is a reproducible example-

Make a dummy dataframe with 3 columns + few outliers

sample<-data.frame(a=c(444,2,3,4,-555), b=c(2,3,4,5,68), c=c(-100,8,9,10,11))
> sample
     a  b    c
1  444  2 -100
2    2  3    8
3    3  4    9
4    4  5   10
5 -555 68   11

Define the outliers for each column

out<-lapply(1:length(sample), function(i) sort(boxplot.stats(sample[[i]])$out))
> out
[[1]]
[1] -555  444

[[2]]
[1] 68

[[3]]
[1] -100

Subset data by omitting the outliers

sample<-lapply(1:length(sample), function(i) 
  subset(sample[[i]], sample[[i]]!=out[[i]]))

Surprisingly it only works partially with warnings?!?

Warning message:
In sample[[i]] != out[[i]] :
  longer object length is not a multiple of shorter object length

Data after subset looks like

> sample
[[1]]
[1] 444   2   3   4

[[2]]
[1] 2 3 4 5

[[3]]
[1]  8  9 10 11

For column 1, it removed only -555, kept 444?? Worked nicely for column 2 and 3. The warning message clearly states why is it happening. By removing one outlier from each group, it might be keeping similar lengths ...

My second approach is to make all outliers 'NA'

sample<-lapply(1:length(sample), function(i) 
  sample[[i]][sample[[i]]==out[[i]]]<-NA)

Doesn't work!! How can I solve this problem?

Try this:

> lapply(1:length(sample), function(i)
         subset(sample[[i]], !sample[[i]]%in%out[[i]]) )
[[1]]
[1] 2 3 4

[[2]]
[1] 2 3 4 5

[[3]]
[1]  8  9 10 11

Note that when you do sample[[i]]!=out[[i]]) it doesn't work because sample[[i]] is a vector, and so is out[[i]] . What you actually want to know is what elements of sample[[i]] are not in out[[i]] , so you should do !sample[[i]]%in%out[[i]] .

To further clarify, you can try this toy example:

> c(444,2,3,4,-555) == c(-555, 444)
[1] FALSE FALSE FALSE FALSE  TRUE
Warning message:
In c(444, 2, 3, 4, -555) == c(-555, 444) :
  longer object length is not a multiple of shorter object length
> c(444,2,3,4,-555) %in% c(-555, 444)
[1]  TRUE FALSE FALSE FALSE  TRUE

In the == example you get a TRUE at the end because of recycling . Internally, it is actually comparing these two vectors c(444,2,3,4,-555) == c(-555, 444, -555, 444, -555) , and the last element is the same.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM