ggplot color, shape and size by factor variables in dataframe over several regions with legend

Question

I have the following dataframe:

structure(list(PS_position = c(54733745L, 54736536L, 54734312L, 54735312L, 54733745L, 54736536L, 54734312L, 54735312L),
           chr_key = c(19L,19L, 19L, 19L, 19L, 19L, 19L, 19L),
           hit_count = c(20L, 1L, 5L,15L, 20L, 1L, 5L, 15L),
           pconvert = c(0.448, 0.55, 0.8, 0.92, 0.448, 0.55, 0.8, 0.92),
           probe_type = c("Non_polymorphic", "preselected", "unvalidated", "validated", "Non_polymorphic", "preselected", "unvalidated", "validated"),
           region_name = c("DL1", "DL1", "DL1", "DL1", "DL2", "DL2", "DL2", "DL2"),
           start = c(54724479L, 54724479L, 54724479L, 54724479L, 54724479L, 54724479L, 54724479L, 54724479L),
           stop = c(54736536L, 54736536L, 54736536L, 54736536L, 54736536L, 54736536L, 54736536L, 54736536L)),
      row.names = c(NA, -8L), class = c("data.table",   "data.frame"))

I would like to plot PS_position in each region_name on the x-axis colored by probe_type , shape based on pconvert categories (0.3 - 0.5, 0.51-0.7, 0.71-0.9, > 0.9) and size of the shape based on hit_count over all unique region_names in the dataframe and a legend describing the same. xlim for the plot will be start / stop from the dataframe.

Somewhat like this:

Of course, the actual values will vary for each unique region_name . Any ideas on how to best achieve this? Thanks!

Edit: I had developed something in base R which does not have hitcount or pconvert

region = unique(df$region_name)
for(i in seq_along(region))
{
probes = df$PS_position
probe_type = factor(df$probe_type)
df$cols = as.numeric(as.factor(df$probe_type))
legend.cols = as.numeric(as.factor(levels(df$probe_type)))


#should also send the start and stop into PS_position 
cols = c("black", "blue", "green", "yellow")
#Use logarithmic scale
par(xpd = T)

plot(1, 1, ylim = c(0.5, length(probes)), xlim = c(min(probes) - 20, max(probes)+10),#, main = paste("Probes ", region, sep = ""), 
     xlab = "PS_position", bty="n", type = "n", yaxt = "n", ylab = "")

title(region[i], line=0)

begin = min(probes)
end = max(probes)
n = length(probes)

Then I sequentially plot the probes one after another but I don't need that anymore. I just want to plot all PS_position at once and they should reflect the actual start-stop and relative position within those bounds. Note above and below base R code is one block. please copy paste together.

for(i in 1:length(probes))
{
  lines(x = c(begin, end), y = c(n+1-i, n+1-i), col = "blue", lwd = .8)
  xs = probes[1:i]
  #cols_i = cols[probe_type[1:i]]
  points(x = xs, y = rep(n+1-i, length(xs)), pch = 18, cex = 1.0, col = df$cols)
  text(i, x = -50, y = n+1-i, adj = 1.5)
 
}
add_legend("topright", "Probe_Type", levels(probe_type), fill = legend.cols, horiz=T)

}

dev.off()

Trying to convert this to ggplot2

Answer 1

How about this:

I have taken your data and added the categorical pconvert_cat variable:

# comparison of the two variables:
> df[, c(4, 9)]
  pconvert pconvert_cat
1    0.448      0.3-0.5
2    0.550     0.51-0.7
3    0.800     0.71-0.9
4    0.920         >0.9
5    0.448      0.3-0.5
6    0.550     0.51-0.7
7    0.800     0.71-0.9
8    0.920         >0.9

I've tried to plot what you wanted from your question using ggplot2 . Essentially, you want to facet by region_name and then set all the other variables to the given aesthetics ( aes ) you mention in your question.

ggplot(df, aes(x = PS_position, y = 0,
               colour = probe_type, shape = pconvert_cat, size = hit_count)) +
        geom_point() +
        scale_shape_manual(values = c(3, 15, 16, 17)) +
        coord_cartesian(xlim = c(min(df$start), max(df$stop))) +
        facet_wrap(~ region_name, nrow = 2) +
        theme_minimal() + theme(panel.grid = element_blank(),
                                axis.title.y = element_blank(),
                                axis.text.y = element_blank(),
                                axis.ticks.y = element_blank())

This is what it looks like:

Which is probably not ideal. I do not know of any geom_...() function which would simply graph the 'x difference' between points and not bother with the y-axis. SO community, can we do such a thing? Of course, this depends on whether you want any variables for the y-axis too.

Assuming you want everything on the same horizontal plane, I have set y to a constant (0). Maybe you could set y = chr_key , as I notice it is constant (at least in this small data set)?

Also, setting xlim = c(min(df$start), max(df$stop) , means that all your points are quite to the right, as you can see above. Unless you specifically want this, maybe consider dropping the line with coord_cartesian() :

ggplot(df, aes(x = PS_position, y = 0,
               colour = probe_type, shape = pconvert_cat, size = hit_count)) +
        geom_point() +
        scale_shape_manual(values = c(3, 15, 16, 17)) +
        facet_wrap(~ region_name, nrow = 2) +
        theme_minimal() + theme(panel.grid = element_blank(),
                                axis.title.y = element_blank(),
                                axis.text.y = element_blank(),
                                axis.ticks.y = element_blank())

To get this:

The differences between the x-values of the points are clearer here.

Some things to consider:

Will you assign some variable to the y-axis? Will it be constant?
Will there be more than one observation for given probe_type and pconvert_cat values? If so, the colour and shape aesthetics will come more into play.
Do you need the specific x range? You want to make x differences as clear as possible.

Finally, I strongly agree with Rémi's comment that you should let us know what you've already tried. Then I don't have to be guessing quite so much in the answer.

EDIT

In reply to your comment, using facet_wrap() does not mean that scales are fixed. You can change the scales argument to "free_x" in your case, so that you can have different start and stop values for each region_name . For more information about different facet scales look here . You might want to use geom_blank() as is discussed on that page. You will have to decide which of the methods listed there works best for your data. Note than when you add more facets for more region_name s, and keep just one column of facets, they should come closer together and the issue of having a y-scale there will become less important as there won't be so much empty space. (So, for example, you have five different region_name s and you set nrow = 5 .)

In summary, I think my code, with some of the facet scale changes that you can decide upon, is good to go.

Data

df <- structure(list(PS_position = c(54733745L, 54736536L, 54734312L, 54735312L, 54733745L, 54736536L, 54734312L, 54735312L),
               chr_key = c(19L,19L, 19L, 19L, 19L, 19L, 19L, 19L),
               hit_count = c(20L, 1L, 5L,15L, 20L, 1L, 5L, 15L),
               pconvert = c(0.448, 0.55, 0.8, 0.92, 0.448, 0.55, 0.8, 0.92),
               probe_type = c("Non_polymorphic", "preselected", "unvalidated", "validated", "Non_polymorphic", "preselected", "unvalidated", "validated"),
               region_name = c("DL1", "DL1", "DL1", "DL1", "DL2", "DL2", "DL2", "DL2"),
               start = c(54724479L, 54724479L, 54724479L, 54724479L, 54724479L, 54724479L, 54724479L, 54724479L),
               stop = c(54736536L, 54736536L, 54736536L, 54736536L, 54736536L, 54736536L, 54736536L, 54736536L)),
          row.names = c(NA, -8L), class = c("data.table",   "data.frame"))
df$pconvert_cat <- as.factor(ifelse(df$pconvert >= 0.3 & df$pconvert <= 0.5, "0.3-0.5",
                                    ifelse(df$pconvert > 0.5 & df$pconvert <= 0.7, "0.51-0.7",
                                           ifelse(df$pconvert > 0.7 & df$pconvert <= 0.9, "0.71-0.9", ">0.9"))))

ggplot color, shape and size by factor variables in dataframe over several regions with legend

Question

1 answers

solution1
1 ACCPTED 2020-06-25 09:15:32

ggplot color, shape and size by factor variables in dataframe over several regions with legend

Question

1 answers

solution1 1 ACCPTED 2020-06-25 09:15:32

solution1
1 ACCPTED 2020-06-25 09:15:32