简体   繁体   English

如何使用 geom_boxplot(stat = “identity”) 模拟带有异常值的 geom_boxplot()

[英]How to mimic geom_boxplot() with outliers using geom_boxplot(stat = “identity”)

I would like to pre-compute by-variable summaries of data (with plyr and passing a quantile function) and then plot with geom_boxplot(stat = "identity") .我想预先计算数据的变量摘要(使用plyr并传递quantile函数),然后使用geom_boxplot(stat = "identity")绘图。 This works great except it (a) does not plot outliers as points and (b) extends the "whiskers" to the max and min of the data being plotted.这很好用,除了它 (a) 不将异常值绘制为点和 (b) 将“胡须”扩展到正在绘制的数据的最大值和最小值。

Example:例子:

library(plyr)
library(ggplot2)

set.seed(4)
df <- data.frame(fact = sample(letters[1:2], 12, replace = TRUE),
                 val  = c(1:10, 100, 101))
df
#    fact val
# 1     b   1
# 2     a   2
# 3     a   3
# 4     a   4
# 5     b   5
# 6     a   6
# 7     b   7
# 8     b   8
# 9     b   9
# 10    a  10
# 11    b 100
# 12    a 101

by.fact.df <- ddply(df, c("fact"), function(x) quantile(x$val))

by.fact.df
#   fact 0%  25% 50%  75% 100%
# 1    a  2 3.25 5.0 9.00  101
# 2    b  1 5.50 7.5 8.75  100

# What I can do...with faults (a) and (b) above
ggplot(by.fact.df, 
       aes(x = fact, ymin = `0%`, lower = `25%`, middle = `50%`, 
           upper = `75%`,  ymax = `100%`)) +
  geom_boxplot(stat = "identity")

# What I want...
ggplot(df, aes(x = fact, y = val)) +
  geom_boxplot()

What I can do...with faults (a) and (b) mentioned above:我能做什么...上面提到的错误(a)和(b):

地块 01

What I would like to obtain, but still leverage pre-computation via plyr (or other method):我想获得什么,但仍然通过plyr (或其他方法)利用预计算:

地块02

Initial Thoughts: Perhaps there is some way to pre-compute the true end-points of the whiskers without the outliers?初步想法:也许有某种方法可以在没有异常值的情况下预先计算晶须的真实终点? Then, subset the data for outliers and pass them as geom_point() ?然后,对异常值的数据进行子集化并将它们作为geom_point()传递?

Motivation: When working with larger datasets, I have found it faster and more practical to leverage plyr , dplyr , and/or data.table to pre-compute the stats and then plot them rather than having ggplot2 to the calculations.动机:在处理更大的数据集时,我发现利用plyrdplyr和/或data.table来预先计算统计数据然后绘制它们而不是使用ggplot2进行计算更快、更实用。

UPDATE更新

I am able to extract what I need with the following mix of dplyr and plyr code, but I'm not sure if this is the most efficient way:我能够使用以下dplyrplyr代码的组合提取我需要的内容,但我不确定这是否是最有效的方法:

df %>%
  group_by(fact) %>%
  do(ldply(boxplot.stats(.$val), data.frame))

Source: local data frame [6 x 3]
Groups: fact

  fact   .id X..i..
1    a stats      2
2    a stats      4
3    a stats     10
4    a stats     13
5    a stats     16
6    a     n      9

Here's my answer, using built-in functions quantile and boxplot.stats .这是我的答案,使用内置函数quantileboxplot.stats

geom_boxplot does the calcualtions for boxplot slightly differently than boxplot.stats . geom_boxplotgeom_boxplot的计算与boxplot.stats略有不同。 Read ?geom_boxplot and ?boxplot.stats to understand my implementation below阅读?geom_boxplot?boxplot.stats以了解我在下面的实现

#Function to calculate boxplot stats to match ggplot's implemention as in geom_boxplot.
my_boxplot.stats <-function(x){
        quantiles <-quantile(x, c(0, 0.25, 0.5, 0.75, 1))
        labels <-names(quantile(x))
        #replacing the upper whisker to geom_boxplot
        quantiles[5] <-boxplot.stats(x)$stats[5]
        res <-data.frame(rbind(quantiles))
        names(res) <-labels
        res$out <-boxplot.stats(x)$out
        return(res)
    }

Code to calculate the stats and plot it计算统计数据并绘制它的代码

library(dplyr)
df %>% group_by(fact) %>% do(my_boxplot.stats(.$val)) %>% 
      ggplot(aes(x=fact, y=out, ymin = `0%`, lower = `25%`, middle = `50%`,
                 upper = `75%`,  ymax = `100%`)) +
      geom_boxplot(stat = "identity") + geom_point()

To get the correct statistics, you have to do some more calculations than just finding the quantiles.要获得正确的统计数据,您必须进行更多的计算,而不仅仅是找到分位数。 The geom_boxplot function with stat = "identity" does not draw the outliers.带有stat = "identity"geom_boxplot函数不会绘制异常值。 So you have to calculate the statistics without the outliers and then use geom_point to draw the outliers seperately.因此,您必须计算没有异常值的统计数据,然后使用geom_point单独绘制异常值。 The following function (basically a simplified version of stat_boxplot ) is probably not the most efficient, but it gives the desired result:以下函数(基本上是stat_boxplot的简化版本)可能不是最有效的,但它提供了所需的结果:

box.df <- df %>% group_by(fact) %>% do({
  stats <- as.numeric(quantile(.$val, c(0, 0.25, 0.5, 0.75, 1)))
  iqr <- diff(stats[c(2, 4)])
  coef <- 1.5
  outliers <- .$val < (stats[2] - coef * iqr) | .$val > (stats[4] + coef * iqr)
  if (any(outliers)) {
    stats[c(1, 5)] <- range(c(stats[2:4], .$val[!outliers]), na.rm=TRUE)
  }
  outlier_values = .$val[outliers]
  if (length(outlier_values) == 0) outlier_values <- NA_real_
  res <- as.list(t(stats))
  names(res) <- c("lower.whisker", "lower.hinge", "median", "upper.hinge", "upper.whisker")
  res$out <- outlier_values
  as.data.frame(res)
})
box.df
## Source: local data frame [2 x 7]
## Groups: fact
## 
##   fact lower.whisker lower.hinge median upper.hinge upper.whisker out
## 1    a             2        3.25    5.0        9.00            10 101
## 2    b             1        5.50    7.5        8.75             9 100

ggplot(box.df, aes(x = fact, y = out, middle = median,
                   ymin = lower.whisker, ymax = upper.whisker,
                   lower = lower.hinge, upper = upper.hinge)) +
  geom_boxplot(stat = "identity") + 
  geom_point()

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM