简体   繁体   中英

Grouped scatterplot over grouped boxplot in R using ggplot2

I am creating a grouped boxplot with a scatterplot overlay using ggplot2. I would like to group each scatterplot datapoint with the grouped boxplot that it corresponds to.

However, I'd also like the scatterplot points to be different symbols. I seem to be able to get my scatterplot points to group with my grouped boxplots OR get my scatterplot points to be different symbols... but not both simultaneously. Below is some example code to illustrate what's happening:

library(scales)
library(ggplot2) 

# Generates Data frame to plot
Gene <- c(rep("GeneA",24),rep("GeneB",24),rep("GeneC",24),rep("GeneD",24),rep("GeneE",24))
Clone <- c(rep(c("D1","D2","D3","D4","D5","D6"),20))
variable <- c(rep(c(rep("Day10",6),rep("Day20",6),rep("Day30",6),rep("Day40",6)),5))
value <- c(rnorm(24, mean = 0.5, sd = 0.5),rnorm(24, mean = 10, sd = 8),rnorm(24, mean = 1000, sd = 900), 
           rnorm(24, mean = 25000, sd = 9000), rnorm(24, mean = 8000, sd = 3000))
    value <- sqrt(value*value)
        Tdata <- cbind(Gene, Clone, variable)
        Tdata <- data.frame(Tdata)
            Tdata <- cbind(Tdata,value)

# Creates the Plot of All Data
# The below code groups the data exactly how I'd like but the scatter plot points are all the same shape
# and I'd like them to each have different shapes.                        
ln_clr <- "black"
bk_clr <- "white"
point_shapes <- c(0,15,1,16,2,17)
blue_cols <- c("#EFF2FB","#81BEF7","#0174DF","#0000FF","#0404B4")

lp1 <- ggplot(Tdata, aes(x=variable, y=value, fill=Gene)) +
    stat_boxplot(geom ='errorbar', position = position_dodge(width = .83), width = 0.25, 
                 size = 0.7, coef = 4) +
    geom_boxplot( coef=1, outlier.shape = NA, position = position_dodge(width = .83), lwd = 0.3, 
                  alpha = 1, colour = ln_clr) +
    geom_point(position = position_jitterdodge(dodge.width = 0.83), size = 1.8, alpha = 0.7, 
               pch=15)


lp1 + scale_fill_manual(values = blue_cols) + labs(y = "Fold Change") +
    expand_limits(y=c(0.01,10^5)) +
    scale_y_log10(expand = c(0, 0), breaks = c(0.01,1,100,10000,100000),
                  labels = trans_format("log10", math_format(10^.x)))

ggsave("Scatter Grouped-Wrong Symbols.png")

#*************************************************************************************************************************************
# The below code doesn't group the scatterplot data how I'd like but the points each have different shapes
lp2 <- ggplot(Tdata, aes(x=variable, y=value, fill=Gene)) +
    stat_boxplot(geom ='errorbar', position = position_dodge(width = .83), width = 0.25, 
                 size = 0.7, coef = 4) +
    geom_boxplot( coef=1, outlier.shape = NA, position = position_dodge(width = .83), lwd = 0.3, 
                  alpha = 1, colour = ln_clr) +
    geom_point(position = position_jitterdodge(dodge.width = 0.83), size = 1.8, alpha = 0.7, 
               aes(shape=Clone))


lp2 + scale_fill_manual(values = blue_cols) + labs(y = "Fold Change") +
    expand_limits(y=c(0.01,10^5)) +
    scale_y_log10(expand = c(0, 0), breaks = c(0.01,1,100,10000,100000),
                  labels = trans_format("log10", math_format(10^.x)))

ggsave("Scatter Ungrouped-Right Symbols.png")

If anyone has any suggestions I'd really appreciate it.

Thank you Nathan

To get the boxplots to appear, the shape aesthetic needs to be inside geom_point , rather than in the main call to ggplot. The reason for this is that when the shape aesthetic is in the main ggplot call, it applies to all the geoms, including geom_boxplot . However, applying a shape=Clone aesthetic causes geom_boxplot to create a separate boxplot for each level of Clone . Since there's only one row of data for each combination of variable and Clone , no boxplot is produced.

That the shape aesthetic affects geom_boxplot seems counterintuitive to me, but maybe there's a reason for it that I'm not aware of. In any case, moving the shape aesthetic into geom_point solves the problem by applying the shape aesthetic only to geom_point .

Then, to get the points to appear with the correct boxplot, we need to group by Gene . I also added theme_classic to make it easier to see the plot (although it's still very busy):

ggplot(Tdata, aes(x=variable, y=value, fill=Gene)) +
  stat_boxplot(geom ='errorbar', width=0.25, size=0.7, coef=4, position=position_dodge(0.85)) +
  geom_boxplot(coef=1, outlier.shape=NA, lwd=0.3, alpha=1, colour=ln_clr, position=position_dodge(0.85)) +
  geom_point(position=position_jitterdodge(dodge.width=0.85), size=1.8, alpha=0.7, 
             aes(shape=Clone, group=Gene)) +
  scale_fill_manual(values=blue_cols) + labs(y="Fold Change") +
  expand_limits(y=c(0.01,10^5)) +
  scale_y_log10(expand=c(0, 0), breaks=10^(-2:5),
                labels=trans_format("log10", math_format(10^.x))) +
  theme_classic()

在此处输入图片说明

I think the plot would be easier to understand if you use faceting for Gene and the x-axis for variable . Putting time on the x-axis seems more intuitive, while using facetting frees up the color aesthetic for the points. With six different clones, it's still difficult (for me at least) to differentiate the point markers, but this looks cleaner to me than the previous version.

library(dplyr)

ggplot(Tdata %>% mutate(Gene=gsub("Gene","Gene ", Gene)), 
       aes(x=gsub("Day","",variable), y=value)) +
  stat_boxplot(geom='errorbar', width=0.25, size=0.7, coef=4) +
  geom_boxplot(coef=1, outlier.shape=NA, lwd=0.3, alpha=1, colour=ln_clr, width=0.5) +
  geom_point(aes(fill=Clone), position=position_jitter(0.2), size=1.5, alpha=0.7, shape=21) +
  theme_classic() +
  facet_grid(. ~ Gene) +
  labs(y = "Fold Change", x="Day") +
  expand_limits(y=c(0.01,10^5)) +
  scale_y_log10(expand=c(0, 0), breaks=10^(-2:5),
                labels=trans_format("log10", math_format(10^.x)))

在此处输入图片说明

If you really need to keep the points, maybe it would be better to separate the boxplots and points with some manual dodging:

set.seed(10)
ggplot(Tdata %>% mutate(Day=as.numeric(substr(variable,4,5)),
                        Gene = gsub("Gene","Gene ", Gene)), 
       aes(x=Day - 2, y=value, group=Day)) +
  stat_boxplot(geom ='errorbar', width=0.5, size=0.5, coef=4) +
  geom_boxplot(coef=1, outlier.shape=NA, lwd=0.3, alpha=1, width=4) +
  geom_point(aes(x=Day + 2, fill=Clone), size=1.5, alpha=0.7, shape=21,
             position=position_jitter(width=1, height=0)) +
  theme_classic() +
  facet_grid(. ~ Gene) +
  labs(y="Fold Change", x="Day") +
  expand_limits(y=c(0.01,10^5)) +
  scale_y_log10(expand=c(0, 0), breaks=10^(-2:5),
                labels=trans_format("log10", math_format(10^.x)))

在此处输入图片说明

One more thing: For future reference, you can simplify your data creation code:

Gene = rep(paste0("Gene",LETTERS[1:5]), each=24)
Clone = rep(paste0("D",1:6), 20)
variable = rep(rep(paste0("Day", seq(10,40,10)), each=6), 5)
value = rnorm(24*5, mean=rep(c(0.5,10,1000,25000,8000), each=24), 
              sd=rep(c(0.5,8,900,9000,3000), each=24))

Tdata = data.frame(Gene, Clone, variable, value)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM