Removing outliers of different lengths from different columns of a dataframe using R

Question

I have a large dataframe. I want to remove the outliers from each column of my dataframe inferred from boxplots. Here is a reproducible example-

Make a dummy dataframe with 3 columns + few outliers

sample<-data.frame(a=c(444,2,3,4,-555), b=c(2,3,4,5,68), c=c(-100,8,9,10,11))
> sample
     a  b    c
1  444  2 -100
2    2  3    8
3    3  4    9
4    4  5   10
5 -555 68   11

Define the outliers for each column

out<-lapply(1:length(sample), function(i) sort(boxplot.stats(sample[[i]])$out))
> out
[[1]]
[1] -555  444

[[2]]
[1] 68

[[3]]
[1] -100

Subset data by omitting the outliers

sample<-lapply(1:length(sample), function(i) 
  subset(sample[[i]], sample[[i]]!=out[[i]]))

Surprisingly it only works partially with warnings?!?

Warning message:
In sample[[i]] != out[[i]] :
  longer object length is not a multiple of shorter object length

Data after subset looks like

> sample
[[1]]
[1] 444   2   3   4

[[2]]
[1] 2 3 4 5

[[3]]
[1]  8  9 10 11

For column 1, it removed only -555, kept 444?? Worked nicely for column 2 and 3. The warning message clearly states why is it happening. By removing one outlier from each group, it might be keeping similar lengths ...

My second approach is to make all outliers 'NA'

sample<-lapply(1:length(sample), function(i) 
  sample[[i]][sample[[i]]==out[[i]]]<-NA)

Doesn't work!! How can I solve this problem?

Answer 1

Try this:

> lapply(1:length(sample), function(i)
         subset(sample[[i]], !sample[[i]]%in%out[[i]]) )
[[1]]
[1] 2 3 4

[[2]]
[1] 2 3 4 5

[[3]]
[1]  8  9 10 11

Note that when you do sample[[i]]!=out[[i]]) it doesn't work because sample[[i]] is a vector, and so is out[[i]] . What you actually want to know is what elements of sample[[i]] are not in out[[i]] , so you should do !sample[[i]]%in%out[[i]] .

To further clarify, you can try this toy example:

> c(444,2,3,4,-555) == c(-555, 444)
[1] FALSE FALSE FALSE FALSE  TRUE
Warning message:
In c(444, 2, 3, 4, -555) == c(-555, 444) :
  longer object length is not a multiple of shorter object length
> c(444,2,3,4,-555) %in% c(-555, 444)
[1]  TRUE FALSE FALSE FALSE  TRUE

In the == example you get a TRUE at the end because of recycling . Internally, it is actually comparing these two vectors c(444,2,3,4,-555) == c(-555, 444, -555, 444, -555) , and the last element is the same.

Removing outliers of different lengths from different columns of a dataframe using R

Question

1 answers

solution1
1 ACCPTED 2014-01-12 02:54:09

Removing outliers of different lengths from different columns of a dataframe using R

Question

1 answers

solution1 1 ACCPTED 2014-01-12 02:54:09

solution1
1 ACCPTED 2014-01-12 02:54:09