简体   繁体   English

在 R 中标记箱线图的异常值

[英]Labeling Outliers of Boxplots in R

I have the code that creates a boxplot, using ggplot in R, I want to label my outliers with the year and Battle.我有创建箱线图的代码,在 R 中使用 ggplot,我想用年份和战斗标记我的异常值。

Here is my code to create my boxplot这是我创建箱线图的代码

require(ggplot2)
ggplot(seabattle, aes(x=PortugesOutcome,y=RatioPort2Dutch ),xlim="OutCome", 
y="Ratio of Portuguese to Dutch/British ships") + 
geom_boxplot(outlier.size=2,outlier.colour="green") + 
stat_summary(fun.y="mean", geom = "point", shape=23, size =3, fill="pink") + 
ggtitle("Portugese Sea Battles")

Can anyone help?任何人都可以帮忙吗? I knew this is correct, I just want to label the outliers.我知道这是正确的,我只想标记异常值。

The following is a reproducible solution that uses dplyr and the built-in mtcars dataset.以下是使用dplyr和内置mtcars数据集的可重现解决方案。

Walking through the code: First, create a function, is_outlier that will return a boolean TRUE/FALSE if the value passed to it is an outlier.遍历代码:首先,创建一个函数is_outlier ,如果传递给它的值是异常值,它将返回布尔值TRUE/FALSE We then perform the "analysis/checking" and plot the data -- first we group_by our variable ( cyl in this example, in your example, this would be PortugesOutcome ) and we add a variable outlier in the call to mutate (if the drat variable is an outlier [note this corresponds to RatioPort2Dutch in your example], we will pass the drat value, otherwise we will return NA so that value is not plotted).然后我们执行“分析/检查”并绘制数据——首先我们group_by我们的变量(在这个例子中是cyl ,在你的例子中,这将是PortugesOutcome ),然后我们在对mutate的调用中添加一个变量outlier (如果drat变量是一个异常值 [注意这对应于您的示例中的RatioPort2Dutch ],我们将传递drat值,否则我们将返回NA以便不绘制该值)。 Finally, we plot the results and plot the text values via geom_text and an aesthetic label equal to our new variable;最后,我们绘制结果并通过geom_text和与我们的新变量相等的美学标签绘制文本值; in addition, we offset the text (slide it a bit to the right) with hjust so that we can see the values next to, rather than on top of, the outlier points.此外,我们使用hjust偏移文本(将其向右滑动一点),以便我们可以看到离群点旁边而不是顶部的值。

library(dplyr)
library(ggplot2)

is_outlier <- function(x) {
  return(x < quantile(x, 0.25) - 1.5 * IQR(x) | x > quantile(x, 0.75) + 1.5 * IQR(x))
}

mtcars %>%
  group_by(cyl) %>%
  mutate(outlier = ifelse(is_outlier(drat), drat, as.numeric(NA))) %>%
  ggplot(., aes(x = factor(cyl), y = drat)) +
    geom_boxplot() +
    geom_text(aes(label = outlier), na.rm = TRUE, hjust = -0.3)

箱形图

To label the outliers with rownames (based on JasonAizkalns answer)用行名标记异常值(基于JasonAizkalns 的回答)

library(dplyr)
library(ggplot2)
library(tibble)

is_outlier <- function(x) {
  return(x < quantile(x, 0.25) - 1.5 * IQR(x) | x > quantile(x, 0.75) + 1.5 * IQR(x))
}

dat <- mtcars %>% tibble::rownames_to_column(var="outlier") %>% group_by(cyl) %>% mutate(is_outlier=ifelse(is_outlier(drat), drat, as.numeric(NA)))
dat$outlier[which(is.na(dat$is_outlier))] <- as.numeric(NA)

ggplot(dat, aes(y=drat, x=factor(cyl))) + geom_boxplot() + geom_text(aes(label=outlier),na.rm=TRUE,nudge_y=0.05)

带有异常值名称的箱线图

You can do this simply within ggplot itself, using an appropriate stat_summary call.您可以使用适当的stat_summary调用在ggplot本身中简单地执行此操作。

ggplot(mtcars, aes(x = factor(cyl), y = drat, fill = factor(cyl))) + 
  geom_boxplot() +
  stat_summary(
    aes(label = round(stat(y), 1)),
    geom = "text", 
    fun.y = function(y) { o <- boxplot.stats(y)$out; if(length(o) == 0) NA else o },
    hjust = -1
  )

在此处输入图片说明

Does this work for you?这对你有用吗?

library(ggplot2)
library(data.table)

#generate some data
set.seed(123)
n=500
dat <- data.table(group=c("A","B"),value=rnorm(n))

ggplot defines an outlier by default as something that's > 1.5*IQR from the borders of the box.默认情况下,ggplot 将异常值定义为距框边界 > 1.5*IQR 的值。

#function that takes in vector of data and a coefficient,
#returns boolean vector if a certain point is an outlier or not
check_outlier <- function(v, coef=1.5){
  quantiles <- quantile(v,probs=c(0.25,0.75))
  IQR <- quantiles[2]-quantiles[1]
  res <- v < (quantiles[1]-coef*IQR)|v > (quantiles[2]+coef*IQR)
  return(res)
}

#apply this to our data
dat[,outlier:=check_outlier(value),by=group]
dat[,label:=ifelse(outlier,"label","")]

#plot
ggplot(dat,aes(x=group,y=value))+geom_boxplot()+geom_text(aes(label=label),hjust=-0.3)

在此处输入图片说明

Similar answer to above, but gets outliers directly from ggplot2 , thus avoiding any potential conflict in method:与上面类似的答案,但直接从ggplot2获取异常值,从而避免方法中的任何潜在冲突:

# calculate boxplot object
g <- ggplot(mtcars, aes(factor(cyl), drat)) + geom_boxplot()

# get list of outliers 
out <- ggplot_build(g)[["data"]][[1]][["outliers"]]

# label list elements with factor levels
names(out) <- levels(factor(mtcars$cyl))

# convert to tidy data
tidyout <- purrr::map_df(out, tibble::as_tibble, .id = "cyl")

# plot boxplots with labels
g + geom_text(data = tidyout, aes(cyl, value, label = value), 
              hjust = -.3)

在此处输入图片说明

With a small twist on @JasonAizkalns solution you can label outliers with their location in your data frame.通过对@JasonAizkalns 解决方案的一个小改动,您可以用它们在数据框中的位置来标记异常值。

mtcars[,'row'] <- row(mtcars)[,1]
...
mutate(outlier = ifelse(is_outlier(drat), row, as.numeric(NA)))
...

I load the data frame into the R Studio Environment, so I can then take a closer look at the data in outlier rows.我将数据框加载到 R Studio 环境中,以便我可以仔细查看异常行中的数据。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM