[英]How to mimic geom_boxplot() with outliers using geom_boxplot(stat = “identity”)
I would like to pre-compute by-variable summaries of data (with plyr
and passing a quantile
function) and then plot with geom_boxplot(stat = "identity")
.我想预先计算数据的变量摘要(使用
plyr
并传递quantile
函数),然后使用geom_boxplot(stat = "identity")
绘图。 This works great except it (a) does not plot outliers as points and (b) extends the "whiskers" to the max and min of the data being plotted.这很好用,除了它 (a) 不将异常值绘制为点和 (b) 将“胡须”扩展到正在绘制的数据的最大值和最小值。
Example:例子:
library(plyr)
library(ggplot2)
set.seed(4)
df <- data.frame(fact = sample(letters[1:2], 12, replace = TRUE),
val = c(1:10, 100, 101))
df
# fact val
# 1 b 1
# 2 a 2
# 3 a 3
# 4 a 4
# 5 b 5
# 6 a 6
# 7 b 7
# 8 b 8
# 9 b 9
# 10 a 10
# 11 b 100
# 12 a 101
by.fact.df <- ddply(df, c("fact"), function(x) quantile(x$val))
by.fact.df
# fact 0% 25% 50% 75% 100%
# 1 a 2 3.25 5.0 9.00 101
# 2 b 1 5.50 7.5 8.75 100
# What I can do...with faults (a) and (b) above
ggplot(by.fact.df,
aes(x = fact, ymin = `0%`, lower = `25%`, middle = `50%`,
upper = `75%`, ymax = `100%`)) +
geom_boxplot(stat = "identity")
# What I want...
ggplot(df, aes(x = fact, y = val)) +
geom_boxplot()
What I can do...with faults (a) and (b) mentioned above:我能做什么...上面提到的错误(a)和(b):
What I would like to obtain, but still leverage pre-computation via plyr
(or other method):我想获得什么,但仍然通过
plyr
(或其他方法)利用预计算:
Initial Thoughts: Perhaps there is some way to pre-compute the true end-points of the whiskers without the outliers?初步想法:也许有某种方法可以在没有异常值的情况下预先计算晶须的真实终点? Then, subset the data for outliers and pass them as
geom_point()
?然后,对异常值的数据进行子集化并将它们作为
geom_point()
传递?
Motivation: When working with larger datasets, I have found it faster and more practical to leverage plyr
, dplyr
, and/or data.table
to pre-compute the stats and then plot them rather than having ggplot2
to the calculations.动机:在处理更大的数据集时,我发现利用
plyr
、 dplyr
和/或data.table
来预先计算统计数据然后绘制它们而不是使用ggplot2
进行计算更快、更实用。
I am able to extract what I need with the following mix of dplyr
and plyr
code, but I'm not sure if this is the most efficient way:我能够使用以下
dplyr
和plyr
代码的组合提取我需要的内容,但我不确定这是否是最有效的方法:
df %>%
group_by(fact) %>%
do(ldply(boxplot.stats(.$val), data.frame))
Source: local data frame [6 x 3]
Groups: fact
fact .id X..i..
1 a stats 2
2 a stats 4
3 a stats 10
4 a stats 13
5 a stats 16
6 a n 9
Here's my answer, using built-in functions quantile
and boxplot.stats
.这是我的答案,使用内置函数
quantile
和boxplot.stats
。
geom_boxplot
does the calcualtions for boxplot slightly differently than boxplot.stats
. geom_boxplot
对geom_boxplot
的计算与boxplot.stats
略有不同。 Read ?geom_boxplot
and ?boxplot.stats
to understand my implementation below阅读
?geom_boxplot
和?boxplot.stats
以了解我在下面的实现
#Function to calculate boxplot stats to match ggplot's implemention as in geom_boxplot.
my_boxplot.stats <-function(x){
quantiles <-quantile(x, c(0, 0.25, 0.5, 0.75, 1))
labels <-names(quantile(x))
#replacing the upper whisker to geom_boxplot
quantiles[5] <-boxplot.stats(x)$stats[5]
res <-data.frame(rbind(quantiles))
names(res) <-labels
res$out <-boxplot.stats(x)$out
return(res)
}
Code to calculate the stats and plot it计算统计数据并绘制它的代码
library(dplyr)
df %>% group_by(fact) %>% do(my_boxplot.stats(.$val)) %>%
ggplot(aes(x=fact, y=out, ymin = `0%`, lower = `25%`, middle = `50%`,
upper = `75%`, ymax = `100%`)) +
geom_boxplot(stat = "identity") + geom_point()
To get the correct statistics, you have to do some more calculations than just finding the quantiles.要获得正确的统计数据,您必须进行更多的计算,而不仅仅是找到分位数。 The
geom_boxplot
function with stat = "identity"
does not draw the outliers.带有
stat = "identity"
的geom_boxplot
函数不会绘制异常值。 So you have to calculate the statistics without the outliers and then use geom_point
to draw the outliers seperately.因此,您必须计算没有异常值的统计数据,然后使用
geom_point
单独绘制异常值。 The following function (basically a simplified version of stat_boxplot
) is probably not the most efficient, but it gives the desired result:以下函数(基本上是
stat_boxplot
的简化版本)可能不是最有效的,但它提供了所需的结果:
box.df <- df %>% group_by(fact) %>% do({
stats <- as.numeric(quantile(.$val, c(0, 0.25, 0.5, 0.75, 1)))
iqr <- diff(stats[c(2, 4)])
coef <- 1.5
outliers <- .$val < (stats[2] - coef * iqr) | .$val > (stats[4] + coef * iqr)
if (any(outliers)) {
stats[c(1, 5)] <- range(c(stats[2:4], .$val[!outliers]), na.rm=TRUE)
}
outlier_values = .$val[outliers]
if (length(outlier_values) == 0) outlier_values <- NA_real_
res <- as.list(t(stats))
names(res) <- c("lower.whisker", "lower.hinge", "median", "upper.hinge", "upper.whisker")
res$out <- outlier_values
as.data.frame(res)
})
box.df
## Source: local data frame [2 x 7]
## Groups: fact
##
## fact lower.whisker lower.hinge median upper.hinge upper.whisker out
## 1 a 2 3.25 5.0 9.00 10 101
## 2 b 1 5.50 7.5 8.75 9 100
ggplot(box.df, aes(x = fact, y = out, middle = median,
ymin = lower.whisker, ymax = upper.whisker,
lower = lower.hinge, upper = upper.hinge)) +
geom_boxplot(stat = "identity") +
geom_point()
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.