简体   繁体   English


[英]Excluding outliers when plotting a Stripchart with ggplot2

I'm trying to create a combination Boxplot/Scatterplot. 我正在尝试创建组合Boxplot / Scatterplot。 I'm doing alright with it so far but there's one issue that's really bothering me that I've been unable to figure out. 到目前为止,我做的还不错,但是确实有一个问题困扰着我,我一直无法弄清。 I'm in R and I've installed the ggplot2 package. 我在R中,已经安装了ggplot2软件包。 Here's the code I'm using: 这是我正在使用的代码:

  #(xx= stand in for my data set, which I imported from excel with the 
      #  column labels as the X-axis values)
  > boxplot(xx, lwd = 1.5, ylab = 'Minutes', xlab = "Epoch")
  > stripchart(xx, vertical = TRUE, 
  +            method = "jitter", add = TRUE, pch = 20, col = 'blue')

This gives me a plot that is pretty close to what I want but the problem is that the outliers are placed on the chart twice. 这给了我一个非常接近我想要的图,但是问题是离群值在图表上被放置了两次。 If possible, I'd like to have the stripchart exclude them (highest groups of blue dots) and only use the ones from the boxplot (black outlined circles) so they stand out as different and don't look so sloppy. 如果可能的话,我想让带状图排除它们(蓝色点的最高组),而仅使用箱线图中的那些(黑色轮廓的圆圈),以便它们脱颖而出,看起来也不那么草率。

I've tried to alter the points in question by putting a lot of different outlier arguments into the stripchart command, unfortunately with no luck. 我试图通过将很多不同的异常参数放入stripchart命令来改变问题的点,不幸的是没有运气。 I've tried setting y-limits below their values, tried using outline=false (which completely removes the stripchart), tried changing outlier color, outpch, etc. The command has not worked for any of these attempts. 我尝试将y-limits设置为低于它们的值,尝试使用outline = false(这将完全删除带状图),尝试更改离群值颜色,输出​​等。该命令对于任何这些尝试均无效。 Here's an example of ylim: 这是ylim的示例:

 > stripchart(xx, vertical = TRUE, 
+       method = "jitter", add = TRUE, pch = 20, col = 'blue', ylim = true, 
ylim (0,20))

Error in ylim(0, 20) : could not find function "ylim" ylim(0,20)中的错误:找不到函数“ ylim”

And here's an example with outlier color: 这是一个具有异常颜色的示例:

> stripchart(xx vertical = TRUE, 
+   method = "jitter", add = TRUE, pch = 20, col = 'blue', outcol = "black")

Warning messages: 警告信息:
1: In plot.xy(xy.coords(x, y), type = type, ...) : "outcol" is not a graphical parameter 1:在plot.xy(xy.coords(x,y)中,type = type,...):“ outcol”不是图形参数
.......# warning messages continue as such. .......#警告消息继续继续。

Are stripcharts capable of outlier exclusion? 带状图是否可以异常排除? Or do I simply not know enough about them yet (and R as a whole, for that matter) to effectively write the code? 还是我对它们还不够了解(就此而言,R还是一个整体)还不足以有效地编写代码?

If this can be done, how should I proceed? 如果可以做到,我应该如何进行? I'm totally fine with solutions that don't directly address the outlier issue in terms of the data as long as the visual effect on the plot is the same. 只要在图上的视觉效果相同,就不能直接解决数据方面的异常问题的解决方案我完全可以解决。

Thank you for your time and any help you can give! 多谢您抽出宝贵时间,以及您能提供的任何帮助!

Edit: Here's some of the data to play around with. 编辑:这是一些数据。 Top row is column labels and data is beneath. 第一行是列标签,数据在下面。 Sorry if this formatting is bad.The 29s and 30s and such in the 9th row of data, 10th overall, are examples of some of the points plotted as outliers in my graphs that I would like to keep in the boxplot but not in the scatterplot/stripchart. 抱歉,如果这种格式不好,那么在第9行数据中的29s和30s等(总第10位)就是我的图形中作为异常值绘制的一些点的示例,我想保留在箱线图中而不是散点图中/ stripchart。

1   5   10  15  30  60
7.233333333 8.166666667 9.666666667 7.75    9   7
7.133333333 9.25    9.333333333 9.75    10  11
0.733333333 0.5 0.833333333 1   1   0
1.766666667 1.166666667 1   0.75    1   0
1.75    2.25    2.333333333 2.25    1   1
6.75    7   7.166666667 7.75    6.5 7
1.516666667 1.75    1.333333333 2   2   2
1.533333333 1.5 2   1.25    1.5 2
27.3    28.33333333 29.33333333 30.25   28.5    29
6.35    6   6.333333333 7   6   6
7.083333333 8.333333333 8.833333333 8.75    8   8
8.533333333 10.08333333 10.5    12  10.5    11
7.65    8.416666667 9   10.75   9   12
6.85    7.333333333 8   7.25    6   8
4.433333333 5   5.5 5   6.5 6
8.616666667 10  11.66666667 12.25   13  12
3.633333333 3.75    3.5 3.25    3   2
0.8 0.75    0.833333333 1   1   0
7.283333333 8.583333333 9.666666667 9.75    12  8
7.483333333 8.75    8.333333333 7.75    6.5 7
3.466666667 2.916666667 3.166666667 2.5 2   0
5.483333333 6.416666667 6.833333333 6.75    7   8

There are a few things going on here. 这里发生了一些事情。 If you wanted to stick with the base plotting functions ( boxplot() and stripchart() ), you could simply tell stripchart to plot only the points that are within some criterion. 如果您想坚持使用基本绘图功能( stripchart() boxplot()stripchart() ),则可以简单地告诉stripchart仅绘制某些条件内的点。 A common standard for outliers would be any point 3 or more standard deviations away from the mean. 离群值的通用标准是离均值3个或更多标准偏差的任何点。 Instead of passing your unmodified data set to stripchart , we subset that data set (note the [ ] brackets). 而不是将未修改的数据集传递给stripchart ,我们对该数据集进行了子集化(请注意[ ]括号)。

stripchart(xx[xx <= mean(xx) + sd(xx) * 3], vertical = T, method = 'jitter', add = T, pch = 20, col = 'blue')


Of course, if you really did want to use ggplot2 (and I recommend installing not only that package, but the entire tidyverse with install.packages('tidyverse') ), you could produce an arguably nicer plot: 当然,如果您确实想使用ggplot2 (我建议您不仅安装该软件包,而且还建议使用install.packages('tidyverse')安装整个tidyverse ),那么可以得出一个更好的图:


The data formatting and commands needed to produce the ggplot version are quite different from the base graphics version, and beyond the scope of this answer. 产生ggplot版本所需的数据格式和命令与基本图形版本完全不同,超出了此答案的范围。 Reproducible code follows. 可复制的代码如下。


df <- structure(list(X1 = c(7.233333333, 7.133333333, 0.733333333, 1.766666667, 1.75, 6.75, 1.516666667, 1.533333333, 27.3, 6.35, 7.083333333, 8.533333333, 7.65, 6.85, 4.433333333, 8.616666667, 3.633333333, 0.8, 7.283333333, 7.483333333, 3.466666667, 5.483333333 ), X5 = c(8.166666667, 9.25, 0.5, 1.166666667, 2.25, 7, 1.75, 1.5, 28.33333333, 6, 8.333333333, 10.08333333, 8.416666667, 7.333333333, 5, 10, 3.75, 0.75, 8.583333333, 8.75, 2.916666667, 6.416666667 ), X10 = c(9.666666667, 9.333333333, 0.833333333, 1, 2.333333333, 7.166666667, 1.333333333, 2, 29.33333333, 6.333333333, 8.833333333, 10.5, 9, 8, 5.5, 11.66666667, 3.5, 0.833333333, 9.666666667, 8.333333333, 3.166666667, 6.833333333), X15 = c(7.75, 9.75, 1, 0.75, 2.25, 7.75, 2, 1.25, 30.25, 7, 8.75, 12, 10.75, 7.25, 5, 12.25, 3.25, 1, 9.75, 7.75, 2.5, 6.75), X30 = c(9, 10, 1, 1, 1, 6.5, 2, 1.5, 28.5, 6, 8, 10.5, 9, 6, 6.5, 13, 3, 1, 12, 6.5, 2, 7), X60 = c(7L, 11L, 0L, 0L, 1L, 7L, 2L, 2L, 29L, 6L, 8L, 11L, 12L, 8L, 6L, 12L, 2L, 0L, 8L, 7L, 0L, 8L)), .Names = c("X1", "X5", "X10", "X15", "X30", "X60"), class = "data.frame", row.names = c(NA, -22L))

df.long <- gather(df, x, value) %>% 
  mutate(x = as.factor(as.numeric(gsub('X', '', x)))) %>% 
  group_by(x) %>% 
  mutate(is.outlier = value > mean(value) + sd(value) * 3)

plot.df <- ggplot(data = df.long, aes(x = x, y = value, group = x)) +
  geom_boxplot() +
  geom_point(data = filter(df.long, !is.outlier), color = '#0000ff88', position = position_jitter(width = 0.1))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

粤ICP备18138465号  © 2020-2024 STACKOOM.COM