简体   繁体   English

将回归线添加到 ggscatter plot 但忽略分组

[英]Add a regression line to ggscatter plot but ignore grouping

I am using ggscatter on R to plot a pearson correlation between two variables.我在 R 到 plot 上使用 ggscatter 两个变量之间的皮尔逊相关性。 However, when I color points, it appears that one reg.line is computed for each different colors.但是,当我为点着色时,似乎为每个不同的 colors 计算了一个 reg.line。 What I want to do is to color y points in the plot according to the column named 'mycolor' but I want the regression line to be computed on the whole data, regardless of the color.我想要做的是根据名为“mycolor”的列对 plot 中的 y 点进行着色,但我希望在整个数据上计算回归线,而不管颜色如何。

Here is the function I use, with color or without color:这是我使用的 function,有颜色或无颜色:

df < - structure(list(my_x = c(131L, 100L, NA, 125L, 50L, 50L, 16L, 
3L, 27L, 96L, 176L, 121L, 129L, 84L, 67L, 35L, 36L, 18L, 29L, 
29L, 26L, 25L, 24L, 20L, 28L, 22L, 25L, 15L, 0L, 18L, 13L, 17L, 
14L, 23L, 27L, NA, 6L, 1L, 7L, 1L, 20L, 30L, 16L, 22L, 23L, 22L, 
17L, 12L, 14L, 28L, 16L, 20L, 44L, 27L, 16L, 6L, 10L, 9L, 16L, 
2L, 43L, 6L, 2L, 0L, 1L, 1L, 1L, 1L, 2L, 1L, 47L, 22L, 7L, 3L, 
4L, 3L, 1L, 1L, 1L, 4L, 4L, 1L, 25L, 3L, 3L, 3L, 6L, 6L, 4L, 
1L, 2L, 2L, 5L, 8L, 3L, 5L, 1L, 1L, 1L, 2L, 3L, 6L, 6L, 4L, 8L, 
1L, 4L, 1L, 5L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 2L, 1L, 0L, 0L, 
2L, 0L, 1L, 2L, 3L, 3L, 4L, 4L, 3L, 2L, 3L, 1L, 2L, 1L), my_y = c(134L, 
90L, 130L, 134L, 44L, 48L, 17L, 4L, 19L, 97L, 178L, 39L, 132L, 
90L, 35L, 35L, 36L, 18L, 28L, 14L, 25L, 26L, 24L, 18L, 25L, 22L, 
9L, 15L, 0L, 21L, 6L, 15L, 15L, 21L, 27L, 19L, 7L, 0L, 8L, 2L, 
10L, 30L, 19L, 23L, 12L, 23L, 16L, 6L, 14L, 29L, 15L, 12L, 21L, 
14L, 11L, 7L, 5L, 4L, 16L, 5L, 36L, 5L, 2L, 0L, 1L, 1L, 1L, 1L, 
2L, 1L, 50L, 22L, 7L, 3L, 6L, 3L, 1L, 1L, 1L, 4L, 4L, 1L, 21L, 
3L, 3L, 3L, 6L, 7L, 4L, 1L, 2L, 2L, 1L, 6L, 3L, 2L, 1L, 1L, 2L, 
2L, 3L, 2L, 6L, 7L, 6L, 1L, 4L, 1L, 5L, 2L, 1L, 2L, 2L, 2L, 2L, 
1L, 2L, 2L, 1L, 0L, 0L, 2L, 0L, 1L, 2L, 3L, 2L, 4L, 4L, 3L, 2L, 
3L, 1L, 2L, 1L), mycolor = c("color1", "color1", "color1", 
"color1", "color1", "color1", "color1", "color1", "color1", 
"color1", "color1", "color1", "color1", "color1", "color1", 
"color2", "color2", "color2", "color2", "color2", "color2", 
"color2", "color2", "color2", "color2", "color2", "color2", 
"color2", "color2", "color2", "color2", "color2", "color2", 
"color2", "color2", "color2", "color2", "color7", 
"Turtle", "Turtle", "color2", "color2", "color2", "color2", 
"color2", "color2", "color2", "color2", "color2", "color2", 
"color2", "color2", "color2", "color2", "color2", "color2", 
"color2", "color2", "color2", "color3", "color4", 
"color4", "color4", "color4", "color4", 
"color4", "color4", "color4", "color4", 
"color4", "color4", "color4", "color5", 
"color5", "color5", "color5", "color5", 
"color5", "color5", "color5", "color5", 
"color5", "color5", "color5", "color5", 
"color5", "color5", "color6", "color6", "color6", "color6", 
"color6", "color6", "color6", "color6", "color6", "color6", "color6", "color6", 
"color6", "color6", "color6", "color6", "color6", "color6", "color6", "color6", 
"color6", "color6", "color6", "color6", "color6", "color6", "color6", "color6", 
"color6", "color6", "color6", "color6", "color6", "color6", "color6", "color6", 
"color6", "color6", "color6", "color6", "color6", "color6", "color6", "color6", 
"color6", "color6", "color6", "color6")), class = "data.frame", row.names = c(NA, 
-135L))
df %>%
  ggscatter(., y="my_y", x="my_x",
            color="mycolor",
            add = "reg.line", conf.int = TRUE, 
            cor.coef = TRUE, cor.method = "pearson")


df %>%
  ggscatter(., y="my_y", x="my_x",
            add = "reg.line", conf.int = TRUE, 
            cor.coef = TRUE, cor.method = "pearson")

The two results:两个结果:

在此处输入图像描述

Taking the example above, I basically want to have the plot on the left but replacing the regression lines with the regression line of the right plot以上面的例子为例,我基本上想要左边的 plot 但用右边的回归线 plot 替换回归线

Is there anyway to do this with ggscatter or should I use ggplot2 geom_point and add the regression line myself?无论如何要使用 ggscatter 执行此操作,还是我应该使用 ggplot2 geom_point 并自己添加回归线?

Thanks for any help !谢谢你的帮助 !

Maxime马克西姆

IMHO the easiest appraoch would be to add the regression line manually using geom_smooth .恕我直言,最简单的方法是使用geom_smooth手动添加回归线。

Using mtcars as example data:使用mtcars作为示例数据:

library(ggpubr)
#> Loading required package: ggplot2

mtcars %>%
  mutate(cyl = factor(cyl)) %>%
  ggscatter(., y="hp", x="mpg",
            color="cyl",
            cor.coef = TRUE, cor.method = "pearson") +
  geom_smooth(method = "lm", color = "black")
#> `geom_smooth()` using formula 'y ~ x'

I do not see much advantage in using ggscatter() instead of ggplot() , so I add here an answer that does not use 'ggpubr'.我看不出使用ggscatter()代替ggplot()有多大优势,所以我在这里添加一个不使用“ggpubr”的答案。 Pearson correlation is the OLS (ordinary least squares) correlation, and it does not depend on which variable is the explanatory one and which the response one. Pearson 相关是 OLS(普通最小二乘)相关,它不依赖于哪个变量是解释变量和响应变量。 The R 2 value from lm() is the same as the square of the r from cor.test() . lm()中的R 2值与cor.test()中的r的平方相同。 In contrast, the fitted line does depend on which variable is mapped to x and which one to y aesthetics.相反,拟合线确实取决于哪个变量映射到x以及哪个变量映射到y美学。 Depending on the variables, a linear regression may not be a good approach and major axis regression should be used.根据变量,线性回归可能不是一个好的方法,应该使用长轴回归。 If the variable mapped to x is measured without or with minimal error, or can be considered the cause of the response, then linear regression using lm() as method is the correct approach.如果映射到x的变量的测量没有误差或误差最小,或者可以被认为是响应的原因,那么使用lm()作为方法的线性回归是正确的方法。 However, if both variables are subject to random variation, lm() will result in different fitted lines depending on which of the two variables is arbitrarily mapped to x and which to y .但是,如果两个变量都受到随机变化的影响,则lm()将根据两个变量中的哪个被任意映射到x以及哪个映射到y产生不同的拟合线。

In the first example I show the same example as in the answer by @stefan but using the grammar of graphics to construct the plot.在第一个示例中,我展示了与@stefan 的答案相同的示例,但使用图形语法构造 plot。 I use statistics from 'ggplot2' and from 'ggpmisc'.我使用来自“ggplot2”和“ggpmisc”的统计数据。 What do we gain: 1) we can have the colour mapping only in the plot layer that needs it, geom_point() (without overriding it later), 2) if we wish we can rewrite the code with a different order of the layers, say, plot the scatter on top of the regression line, 3) we gain a lot in flexibility because we can easily mix and match layer functions (geoms and stats from different packages extending 'ggplot2').我们得到了什么:1)我们只能在需要它的 plot 层中使用颜色映射, geom_point() (以后不覆盖它),2)如果我们希望我们可以用不同的层顺序重写代码,比如说,plot 散布在回归线的顶部,3)我们获得了很大的灵活性,因为我们可以轻松混合和匹配层函数(来自不同包的几何和统计数据扩展“ggplot2”)。 Once one understands that we are adding layers to the plot one by one, and that the aesthetics mapping in the call to ggplot() sets only the default for all layers, the intent of the code is clear.一旦了解我们正在向 plot 一层一层地添加层,并且调用ggplot()中的美学映射仅为所有层设置默认值,代码的意图就很清楚了。 The code remains concise.代码保持简洁。

In the second example I use a different data set, and plot MPG in highway and city traffic, as an example of a case where using linear regression is unsuitable and some variation of major axis regression is preferable.在第二个示例中,我在高速公路和城市交通中使用了不同的数据集和 plot MPG,作为不适合使用线性回归并且优选长轴回归的某些变化的示例。

These examples make use of features from 'ggpmisc' (>= 0.5.0), and will not work with earlier versions.这些示例利用了 'ggpmisc' (>= 0.5.0) 的功能,不适用于早期版本。

library(ggplot2)
library(ggpmisc)
#> Loading required package: ggpp
#> 
#> Attaching package: 'ggpp'
#> The following object is masked from 'package:ggplot2':
#> 
#>     annotate

# y depends on x
ggplot(mtcars, aes(y=hp, x=mpg)) +
  geom_point(aes(color=factor(cyl))) +
  stat_correlation(use_label(c("R", "P"))) +
  stat_poly_line()


# both x and y depend on some common factors not plotted
ggplot(mpg, aes(y=hwy, x=cty)) +
  geom_point(aes(color=factor(cyl))) +
  stat_correlation(use_label(c("R", "P"))) +
  stat_ma_line()

Created on 2022-08-21 by the reprex package (v2.0.1)代表 package (v2.0.1) 于 2022 年 8 月 21 日创建

For simplicity, I kept the default theme_gray() , but adding + theme_classic() at the end of the examples above, will make the plots look as in the question.为简单起见,我保留了默认的theme_gray() ,但在上面示例的末尾添加+ theme_classic()将使绘图看起来像问题中的那样。 Alternatively, theme_set(theme_classic()) can be used to change the default theme for the current R session.或者,可以使用theme_set(theme_classic())更改当前 R session 的默认主题。

In both examples, for the correlation annotation I included values matching those in the question.在这两个示例中,对于相关性注释,我都包含了与问题中的值匹配的值。 Other labels are also available, including confidence intervals for r as well as for rank correlation.其他标签也可用,包括r的置信区间以及等级相关性。 'ggpmisc' also provides statistics for adding as annotations the equations of the fitted models. 'ggpmisc' 还提供统计数据,用于将拟合模型的方程作为注释添加。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM