简体   繁体   中英

Remove outlier values from a data.frame using R

I have a data.frame with columns with water quality values.

I would like to remove the outiliers from each column and in place of the values add nodata NA.

EDIT:

I would like to remove outliers as follows:

Outliers > Quantile 95

and

Outliers < Quantile 5

在此处输入图片说明

how could I do that?

I have an example of my situation

df=read.table(text="st  PH  OD  COD N
A   7.3 1.26301094  1.112359589 0.295842925
B   12.69875867 5.670646078 4.841748321 0.096958426
C   9.613564343 1.706277385 7.952266541 0.102672152
D   9.693461149 7.075560183 0.283503075 0.302494648
A   11.2031501  5.444756127 3.133271063 0.421172108
B   9.288552402 4.169068095 10.54049312 0.122900615
C   4.207333379 6.717653051 10.49073885 0.085634135
D   10.98593946 2.352068972 8.468436777 0.142284793
A   8.20679887  7.826764274 4.464242367 0.211200956
B   12.9165421  0.909886436 1.488358471 0.001640961
C   3.971088246 8.500668307 6.315208679 0.319835127
D   4.821068685 3.871082236 8.669284239 0.349317325
A   0.431563127 0.978922921 10.53756208 0.111929377
B   7.546887828 9.946840115 1.584013576 0.426681716
C   4.689617182 8.717656795 7.474709944 0.473463497
D   9.730568456 1.134763618 4.679810195 0.215744107
A   12.06381259 6.862549062 0.559497593 0.231984105
", 
sep = "", header = TRUE)

Using a combination of apply , quantile , and dplyr::na_if , you can do:

df[-1] <- apply(df[-1], 2, as.numeric)
df[-1] <- apply(df[-1], 2,
                 function(x) na_if(x,x[which(x < quantile(x,probs=c(0.05)))]))
df[-1] <- apply(df[-1], 2,
                 function(x) na_if(x,x[which(x > quantile(x,probs=c(0.95),na.rm=T))]))
df

   st        PH        OD        COD          N
1   A  7.300000 1.2630109  1.1123596 0.29584292
2   B 12.698759 5.6706461  4.8417483 0.09695843
3   C  9.613564 1.7062774  7.9522665 0.10267215
4   D  9.693461 7.0755602         NA 0.30249465
5   A 11.203150 5.4447561  3.1332711 0.42117211
6   B  9.288552 4.1690681         NA 0.12290062
7   C  4.207333 6.7176531 10.4907388 0.08563414
8   D 10.985939 2.3520690  8.4684368 0.14228479
9   A  8.206799 7.8267643  4.4642424 0.21120096
10  B        NA        NA  1.4883585         NA
11  C  3.971088 8.5006683  6.3152087 0.31983513
12  D  4.821069 3.8710822  8.6692842 0.34931733
13  A        NA 0.9789229 10.5375621 0.11192938
14  B  7.546888        NA  1.5840136 0.42668172
15  C  4.689617 8.7176568  7.4747099         NA
16  D  9.730568 1.1347636  4.6798102 0.21574411
17  A 12.063813 6.8625491  0.5594976 0.23198410
rm_outlier <- function(x, lq=5/100, uq=95/100) {
qnts = quantile(x, probs=c(lq, uq))
ifelse(x < qnts[1] | x > qnts[2], NA, x)
}
do.call(cbind.data.frame, lapply(df[, -1], rm_outlier))
          PH        OD        COD          N
1   7.300000 1.2630109  1.1123596 0.29584292
2  12.698759 5.6706461  4.8417483 0.09695843
3   9.613564 1.7062774  7.9522665 0.10267215
4   9.693461 7.0755602         NA 0.30249465
5  11.203150 5.4447561  3.1332711 0.42117211
6   9.288552 4.1690681         NA 0.12290062
7   4.207333 6.7176531 10.4907388 0.08563414
8  10.985939 2.3520690  8.4684368 0.14228479
9   8.206799 7.8267643  4.4642424 0.21120096
10        NA        NA  1.4883585         NA
11  3.971088 8.5006683  6.3152087 0.31983513
12  4.821069 3.8710822  8.6692842 0.34931733
13        NA 0.9789229 10.5375621 0.11192938
14  7.546888        NA  1.5840136 0.42668172
15  4.689617 8.7176568  7.4747099         NA
16  9.730568 1.1347636  4.6798102 0.21574411
17 12.063813 6.8625491  0.5594976 0.23198410

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM