简体   繁体   English

在ggplot2中绘制线性回归

[英]Plot a Linear Regression in ggplot2

I have this dataframe: 我有这个数据框:

> head(data)
    sx   yd  sl
1   male 35  36350
2   male 22  35350
3   male 23  28200
4 female 27  26775
5   male 30  33696
6   male 21  28516

Where "sx" is sex, "yd" is years since acquired degree and "sl" is salary. “ sx”是性别,“ yd”是自获得学位以来的年数,“ sl”是薪水。 Using ggplot or plot, I can plot the scatter easily. 使用ggplot或plot,我可以轻松地绘制散点图。

palette(c("pink", "blue"))
plot(data$yr, data$sl, col = factor(data$sx), xlab = "Years Since Earned Highest Degree", ylab = "Salary (dollars)", main = "Salary Increases with Experience", pch = 19)
legend("topleft", legend = unique(data$sx), col = c("blue", "pink"), pch=19)

library(ggplot2)
ggplot(data, aes(x=yd,y=sl)) + 
    geom_point(shape=21, aes(col=sx, bg=sx)) + 
    xlab("Years Since Earned Highest Degree") + 
    ylab("Salary (dollars)") + 
    ggtitle("Salary Increases with Experience") + 
    scale_color_discrete(guide=FALSE) + 
    labs(fill="sex")

However, I have also made a linear model based on the data: 但是,我还根据数据制作了线性模型:

mod<-lm(sl~sx*poly(yd,2),data)

And I am unable to figure out how to plot the data to the graphs. 而且我无法弄清楚如何将数据绘制到图表上。 Specifically, I want two lines corresponding to male and female data superimposed on the scatterplot AND to be color-coded. 具体来说,我希望将对应于男性和女性数据的两条线叠加在散点图上并进行颜色编码。 I would assume that R has some way to do this so that I don't have to actually write out the model. 我认为R有某种方法可以做到这一点,因此我不必实际写出模型。 Either base plot or ggplot answers are good. 基本图或ggplot答案都不错。 Thanks. 谢谢。

Edit: 编辑:

Running this the above ggplot with geom_smooth(aes(col=sx), se = FALSE, method = "lm", formula = sl ~ sx * poly(yd, 2)) : 使用geom_smooth(aes(col=sx), se = FALSE, method = "lm", formula = sl ~ sx * poly(yd, 2))运行上述ggplot:

ggplot(data, aes(x=yd,y=sl)) + geom_point(shape=21, aes(col=sx, bg=sx)) + geom_smooth(aes(col=sx), se = FALSE, method = "lm", formula = sl ~ sx * poly(yd, 2)) + xlab("Years Since Earned Highest Degree") + ylab("Salary (dollars)") + ggtitle("Salary Increases with Experience") + scale_color_discrete(guide=FALSE)+ labs(fill="sex")

Returns this error: 返回此错误:

Error in model.frame.default(formula = formula, data = data, weights = weight,  : 
  variable lengths differ (found for '(weights)')
Error in if (nrow(layer_data) == 0) return() : argument is of length zero
data = data.frame(sx = c("male", "male", "male", "female", "male", "male"),
              yr = c(35, 22, 23, 27, 30, 21),
              sl = c(36350, 35350, 28200, 26775, 33696, 28516))
ggplot(data, aes(x=yr,y=sl)) + 
  geom_point(shape=21, aes(col=sx, bg=sx)) + 
  geom_smooth(aes(color = sx), se = FALSE, method = "lm", formula = y ~ poly(x, 2)) + 
  xlab("Years Since Earned Highest Degree") + 
  ylab("Salary (dollars)") + 
  ggtitle("Salary Increases with Experience") +     
  scale_color_discrete(guide=FALSE)+ labs(fill="sex")

Is this what you want? 这是你想要的吗? You should get individual fits if you have more data for female. 如果您有更多关于女性的数据, 应该获得个性化的拟合。 Right now sum(data$sx == 'female') is 1. There's no way to have a polynomial fit to that. 现在, sum(data$sx == 'female')为1。无法对它进行多项式拟合。
For example, try: 例如,尝试:

data = data.frame(sx = c("male", "male", "male", "female", "male", "male", "female", "female", "female"),
                  yr = c(35, 22, 23, 27, 30, 21, 25, 18, 29),
                  sl = c(36350, 35350, 28200, 26775, 33696, 28516, 27402, 31492, 23195))

This should work. 这应该工作。

I was unable to find a ggplot way to do it, so here is the base plot way to do it: 我找不到执行该操作的ggplot方法,因此这是执行此操作的基本方法:

palette(c("pink", "blue"))
plot(data$yr, data$sl, col = factor(data$sx), xlab = "Years Since Earned Highest Degree", ylab = "Salary (dollars)", main = "Salary Increases with Experience", pch = 19)
legend("topleft", legend = unique(data$sx), col = c("blue", "pink"), pch=19)
lines(seq(0,25,0.1), predict.lm(quad, data.frame(yd = seq(0,25,0.1), sx = "female", stringsAsFactors = TRUE)),col="pink", lwd = 5)
lines(seq(0,25,0.1), predict.lm(quad, data.frame(yd = seq(0,25,0.1), sx = "male", stringsAsFactors = TRUE)),col="blue", lwd = 5)

The two calls to lines are the solution. 对线路的两个调用是解决方案。 If anyone has the ggplot way to do it, I'd appreciate it a lot, as ggplot looks so much better. 如果有人有使用ggplot的方法来做,我会非常感激,因为ggplot看起来好得多。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM