简体   繁体   中英

R ggplot2 scatterplot: adding color for the level of deviation from (regression) geom_smooth line

I'm trying to create a scatterplot (two continuous variables) with ggplot2 that has a regression line. My small dataset (of yearly averages) has most data points on the regression line or close to it and some observations are placed a bit more far away. Would it be possible to color code the observations on the scatterplot based on their distance from the regression line?

This far I manually created the color value groups for the variables myself but this looks a bit too biased. I would like to have something automatic, if possible.

ggplot(data_mean, aes(x= policy1, y= policy2 ))+
  geom_point(aes(colour = group), size=4) +geom_text_repel(aes(label=iso),hjust=0, vjust=0) + 
  geom_smooth(method=lm, se=FALSE, size=0.1) +
  scale_color_manual(name = "Country Categories", # or name = element_blank()
 values=colors) +
theme(legend.position="bottom",
       legend.title=element_blank()) 

Would it be possible to color code the observations on the scatterplot based on their distance from the regression line? Thank you!

It's a bit tough to define which are outliers, and it really depends on the data you have. You can try something like below, where I calculate the residuals from the linear regression, and define those that are outside 2 * sd (residuals) to be outliers.

First something that looks like your data, with some error introduced to policy2

set.seed(888)
data_mean=data.frame(policy1=1:20,policy2=1:20 + rnbinom(20,mu=2,size=2))
data_mean$residuals = abs(lm(policy2~policy1,data=data_mean)$residuals)
# here we define the outliers to be those more than 2 standard error of residuals
data_mean$group = data_mean$residuals > 2*sd(data_mean$residuals)
data_mean$iso = letters[1:20]

Then we plot:

ggplot(data_mean, aes(x= policy1, y= policy2))+
geom_point(aes(colour = group), size=4) +
geom_text_repel(aes(label=iso),hjust=0, vjust=0) + 
geom_smooth(method=lm, se=FALSE, size=0.1) +
theme(legend.position="bottom",
       legend.title=element_blank()) 

在此处输入图片说明

One alternative is actually to using a continuous scale:

ggplot(data_mean, aes(x= policy1, y= policy2))+
  geom_point(aes(colour = residuals), size=4) +
geom_text_repel(aes(label=iso),hjust=0, vjust=0) + 
geom_smooth(method=lm, se=FALSE, size=0.1) +
theme(legend.position="bottom",
       legend.title=element_blank()) +
  scale_color_viridis()

在此处输入图片说明

Again, will be great if you share some bits of the data, and also elaborate on how you want to color the points based on the residuals.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM