简体   繁体   中英

Boxplot outlier labeling in R

I want to draw boxplots in R and add names to outliers. So far I found this solution .

The function there provides all the functionality I need, but it scrambles incorrectly the labels. In the following example, it marks the outlier as "u" instead of "o":

library(plyr)
library(TeachingDemos)
source("http://www.r-statistics.com/wp-content/uploads/2011/01/boxplot-with-outlier-label-r.txt") # Load the function
set.seed(1500)
y <- rnorm(20)
x1 <- sample(letters[1:2], 20,T)
lab_y <- sample(letters, 20)
# plot a boxplot with interactions:
boxplot.with.outlier.label(y~x1, lab_y)

Do you know of any solution? The ggplot2 library is super nice, but provides no such functionality (as far as I know). My alternative is to use the text() function and extract the outlier information from the boxplot object. However, like this the labels may overlap.

Thanks a lot :-)

I took a look at this with debug(boxplot.with.outlier.label) , and ... it turns out there's a bug in the function.

The error occurs on line 125, where the data.frame DATA is constructed from x , y and label_name .

Previously x and y have been reordered, while lab_y hasn't been. When the supplied value of x (your x1 ) isn't itself already in order, you'll get the kind of jumbling you experienced.

As an immediate fix, you can pre-order the x values like this (or do something more elegant)

df <- data.frame(y, x1, lab_y, stringsAsFactors=FALSE)
df <- df[order(df$x1), ]
# Needed since lab_y is not searched for in data (though it probably should be)
lab_y <- df$lab_y  

boxplot.with.outlier.label(y~x1, lab_y, data=df)

通过上述程序产生的箱图

The intelligent point label placement is a separate issue discussed here or here . There's no ultimate and ideal solution so you just have to pick one there.

So you would overplot the normal boxplot with labels, as follows:

set.seed(1501)
y <- c(4, 0, 7, -5, rnorm(16))
x1 <- c("a", "a", "b", "b", sample(letters[1:2], 16, T))
lab_y <- sample(letters, 20)

bx <- boxplot(y~x1)

out_lab <- c()
for (i in seq(bx$out)) { 
    out_lab[i] <- lab_y[which(y == bx$out[i])[1]]
}

identify(bx$group, bx$out, labels = out_lab, cex = 0.7)

Then, during the identify() is running, you just click to position where you want the label, as described here . When finished, you just press "STOP". Note that each outlier can have more than one label! In my solution, I just simply picked the first!!

PS: I feel ashamed for the for loop, but don't know how to vectorize it - feel free to post improvement.

EDIT: inspired by the Federico's link now I see it can be done much easier! Just these 2 commands:

boxplot(y~x1)
identify(as.integer(as.factor(x1)), y, labels = lab_y, cex = 0.7)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM