针对 x 的每个值在 y 的条件分布上绘制回归线

Question

For each value of x (educ in this case) I want to plot the distribution of y (income) and add the regression line of y ~ x.对于 x 的每个值（在这种情况下为 educ），我想 plot y（收入）的分布并添加 y ~ x 的回归线。

df <- structure(list(
       income = c(16L, 18L, 26L, 16L, 34L, 22L, 42L, 
                  42L, 16L, 20L, 66L, 26L, 20L, 30L, 20L, 30L, 32L, 16L, 20L, 58L, 
                  30L, 26L, 20L, 40L, 32L, 22L, 20L, 56L, 32L, 30L, 30L, 48L, 40L, 
                  84L, 50L, 38L, 30L, 76L, 48L, 36L, 40L, 44L, 30L, 60L, 24L, 88L, 
                  46L, 50L, 50L, 22L, 26L, 46L, 22L, 24L, 64L, 62L, 24L, 50L, 32L, 
                  34L, 52L, 24L, 22L, 20L, 30L, 24L, 120L, 22L, 82L, 18L, 26L, 
                  104L, 28L, 32L, 38L, 44L, 22L, 18L, 24L, 56L), 
       educ = c(10L, 7L, 9L, 11L, 14L, 12L, 16L, 16L, 9L, 10L, 16L, 12L, 10L, 15L, 
                10L, 19L, 16L, 11L, 10L, 16L, 12L, 10L, 8L, 12L, 10L, 11L, 10L, 
                14L, 12L, 11L, 14L, 14L, 7L, 18L, 10L, 12L, 12L, 16L, 16L, 11L, 
                11L, 12L, 10L, 15L, 9L, 17L, 16L, 16L, 14L, 11L, 12L, 16L, 9L, 
                 9L, 14L, 16L, 10L, 13L, 10L, 16L, 18L, 12L, 14L, 13L, 14L, 13L, 
                18L, 10L, 16L, 12L, 12L, 14L, 12L, 12L, 14L, 12L, 12L, 10L, 12L, 
                20L), 
       race = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
              1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
              2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 
              3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 
              3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 
              3L, 3L, 3L, 3L, 3L, 3L, 3L), .Label = c("b", "h", "w"), class = "factor"), 
       race2 = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
              1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
              2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 
              3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 
              3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 
              3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L), z1 = c(1L, 1L, 1L, 1L, 1L, 
              1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 0L, 0L, 0L, 0L, 
              0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
              0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
              0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
              0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L
              ), 
       z2 = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
              0L, 0L, 0L, 0L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
              1L, 1L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
              0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
              0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
              0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L)), row.names = c(NA, -80L), class = c("tbl_df", 
         "tbl", "data.frame"))

So far, i have used ggridges package to plot the distribution of y at each value of x.到目前为止，我已经使用ggridges package 到 plot y 在每个 x 值处的分布。 Nonetheless, by doing so, I actually have to change the coordinates of each variable (x becomes y and viceversa).尽管如此，通过这样做，我实际上必须更改每个变量的坐标（x 变为 y，反之亦然）。 To 'revert' this, I flipped the coordinates and as a result I get this:为了“还原”这个，我翻转了坐标，结果我得到了这个：

ggplot(df, aes(x = income, y = educ, group = educ)) +
     geom_density_ridges(jittered_points = TRUE,
                  position = position_points_jitter(height = 0),
                  point_size = 1.5,
                  point_shape = 1,
                  alpha = 0.3) +
      coord_flip()

The problem is that, if I add a regression line to the plot, I get a regression line for each value of educyr (as I had to group them for applying geom_density_ridges() ).问题是，如果我向 plot 添加一条回归线，我会得到每个 educyr 值的回归线（因为我必须将它们分组以应用geom_density_ridges() ）。 Furthermore, the regression line its actually x ~ y instead of y ~ x.此外，回归线实际上是 x ~ y 而不是 y ~ x。

To try to solve this, I found the regression line for x ~ y equivalent to y ~ x, so that the regression line looks eactly the same as if I had apply geom_smooth() but with educyr as x and hrinc as y.为了解决这个问题，我发现 x ~ y 的回归线等同于 y ~ x，因此回归线看起来与我应用geom_smooth()相同，但 educyr 为 x，hrinc 为 y。

 fit <- lm(df$income ~ df$educ)
 slope <- 1/fit$coefficients[[2]]
 intercept <- fit$coefficients[[1]]/fit$coefficients[[2]] * -1

 ggplot(df, aes(x = income, y = educ, group = educ)) +
 geom_density_ridges(jittered_points = TRUE,
                  position = position_points_jitter(height = 0),
                  point_size = 1.5,
                  point_shape = 1,
                  alpha = 0.3) + 
 stat_function(fun=function(x) intercept + slope*x, color = "red") +
 scale_y_continuous(breaks=seq(0, 20, 5), limits=c(8, 20)) +
 coord_flip()

Which is the same as I would have get if I had used:如果我使用过，这与我会得到的相同：

ggplot(df, aes(x = educ, y = income)) +
    geom_point() +
    geom_smooth(method = "lm", se = FALSE)

I was wondering if there is a better way, to do this.我想知道是否有更好的方法来做到这一点。 Specificaly, if there is a way to plot the distribution of y for each value of x using ggplot2 but without using ggridges , so I don´t need to reverse the coordinates.具体来说，如果有办法 plot 使用ggplot2但不使用ggridges的每个 x 值的 y 分布，所以我不需要反转坐标。

Answer 1

It sounds as though you want to represent the 1-d density of income at each (binned) value of educ .听起来好像您想代表educ的每个（分箱）值的一维income密度。 I think the ggridges approach is good here.我认为ggridges方法在这里很好。 If you want another way of doing it, you could do it with geom_tile where the fill or the alpha represent density.如果您想要另一种方法，您可以使用geom_tile来完成，其中填充或 alpha 表示密度。 This requires building the densities manually first though, which is a bit of a pain.不过，这需要先手动构建密度，这有点麻烦。 The end result is quite nice, but I'm not convinced its nicer than ggridges .最终结果非常好，但我不相信它比ggridges更好。 However, it does have the benefit of not needing to be flipped for regression:但是，它确实具有不需要翻转以进行回归的好处：

d <- do.call(c, lapply(split(df$income, round(df$educ)), function(x) {
  if(length(x) > 1) 
    density(x, from = 12, to = 125)$y * length(x) 
  else 
    numeric(512)}))

df_dens <- data.frame(educ = rep(sort(unique(round(df$educ))), each = 512), 
                      income = rep(seq(12, 125, length.out = 512), 
                               length(sort(unique(round(df$educ))))),
                      dens = d)

ggplot(df, aes(x = educ, y = income)) + 
  geom_tile(data = df_dens, aes(alpha = dens), fill = "red") +
  scale_alpha_continuous(range = c(0, 1)) +
  geom_point() +
  geom_smooth(method = "lm", colour = "red4", se = FALSE, linetype = 2)

针对 x 的每个值在 y 的条件分布上绘制回归线

问题描述

1 个解决方案

解决方案1
1 已采纳 2020-07-08 15:00:39

针对 x 的每个值在 y 的条件分布上绘制回归线

问题描述

1 个解决方案

解决方案1 1 已采纳 2020-07-08 15:00:39

解决方案1
1 已采纳 2020-07-08 15:00:39