简体   繁体   中英

R: identify outliers and mark them in a boxplot

I have the following fake data representig the answering times (in seconds) of different users in an online questionnaire:

n <- 1000

dat <- data.frame(user = 1:n, 
                  question = sample(paste("q", 1:10, sep = ""), size = 10, replace = TRUE),
                  time = round(rnorm(n, mean = 10, sd=4), 0)
                  )
dat %>%
  ggplot(aes(x = question, y = time)) + 
  geom_boxplot(fill = 'orange') +
  ggtitle("Answering time per question")

Then, I am plotting the answering times as boxplots for each question. But how can I first calculate a column with a binary variable showing whether a case is an outlier or not [defined as median(time) +/- 3 * mad(time) ] within each question ?

library(dplyr)
dat %>%
  group_by(question) %>%
  mutate(outlier = abs(time - median(time)) > 3*mad(time) ) %>%
  ungroup() %>%
  
  ggplot(aes(x = question, y = time)) + 
  geom_boxplot(fill = 'orange') +
  
  geom_point(data = . %>% filter(outlier), color = "red") +
  ggtitle("Answering time per question")

By first grouping within each question, the calculation is applied for each row compared to the median and mad for that question.

在此处输入图像描述

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM