简体   繁体   中英

stacked barplot converting a variable into a presence absence based percentage for unrelated variables in ggplot2 R

Following is a sample data frame

df <- data.frame(SampleID = c(1, 2, 3, 4, 5, 6, 7, 8),
                 Var1 = c(0.1 , 0.5,    0.7,    0,  0,  0,  0.5,    0.2), 
                 Var1PA = c("Present", "Present", "Present", "Absent", "Absent", "Absent",  "Present", "Present"), 
                 Var2 = c(0, 0, 0, 0, 0.1, 0.5, 0.7, 0.2), 
                 Var2PA = c("Absent", "Absent", "Absent", "Absent", "Present", "Present", "Present", "Present"))

My question started off as seemingly simple, but I could not find a way to edit the dataframe suitably to plot a barplot.

For Var1, I want to plot a stacked barplot of the percent of times var1 was present in the sample (ie var1 value > 0) or absent (Similarly for var2 and so on).

I could determine this percentage by:

(1 - sum(df$Var1 == 0) / length(df$Var1)) * 100

But how do I convert this into a percentage while plotting? I looked at many melt options, but there is no unifying criteria for these variables that would make a common X axis

Finally, how does one answer the question above if I want to plot 5 variables from a dataframe of 1000 such column variables?

Edit: Thanks for the answers so far! I have a slight edit to the question I just added one more variable to my data frame

df <- data.frame(SampleID = c(1, 2, 3, 4, 5, 6, 7, 8),
             Var1 = c(0.1 , 0.5,    0.7,    0,  0,  0,  0.5,    0.2), 
             Var1PA = c("Present", "Present", "Present", "Absent", "Absent", "Absent",  "Present", "Present"), 
             Var2 = c(0, 0, 0, 0, 0.1, 0.5, 0.7, 0.2), 
             Var2PA = c("Absent", "Absent", "Absent", "Absent", "Present", "Present", "Present", "Present"),
             Disease = c("Case", "Control", "Case", "Control", "Case", "Control", "Case", "Control"))

I am trying to figure out how to plot the barplot for cases and controls with presence absence stacked within them for Var1PA, Var2PA and so on. If I have the right data frame input, the ggplot2 code would be: vars <- c('Var1PA', 'Var2PA', 'Var2PA') ##based on the first comment by @rawr tt <- data.frame(prop.table(as.table(sapply(df[, vars], table)), 2) * 100) ggplot(tt, aes(Disease, Freq)) +
geom_bar(aes(fill = Var1), position = "stack", stat="identity") + facet_grid(~vars)

How do I get percentages for cases (present and absent) and controls (present and absent) for each of the vars? Thanks!

This should generalize nicely. You can, of course, be more selective about the variables you pick.

library(dplyr)
library(tidyr)
mdf = df %>% select(SampleID, ends_with("PA")) %>%
    gather(key = Var, value = PA, -SampleID) %>%
    mutate(PA = factor(PA, levels = c("Present", "Absent")))

ggplot(mdf, aes(x = Var, fill = PA)) +
    geom_bar(position = "fill") +
    scale_y_continuous(labels = scales::percent)

在此处输入图片说明

You can add the percentage columns to the long data frame:

mdf %>% group_by(Var) %>%
    mutate(p_present = mean(PA == "Present"),
           p_absent = mean(PA == "Absent"))
# Source: local data frame [16 x 5]
# Groups: Var [2]
# 
#    SampleID    Var      PA p_present p_absent
#       <dbl>  <chr>  <fctr>     <dbl>    <dbl>
# 1         1 Var1PA Present     0.625    0.375
# 2         2 Var1PA Present     0.625    0.375
# 3         3 Var1PA Present     0.625    0.375
# 4         4 Var1PA  Absent     0.625    0.375
# 5         5 Var1PA  Absent     0.625    0.375
# 6         6 Var1PA  Absent     0.625    0.375
# 7         7 Var1PA Present     0.625    0.375
# 8         8 Var1PA Present     0.625    0.375
# 9         1 Var2PA  Absent     0.500    0.500
# 10        2 Var2PA  Absent     0.500    0.500

Or if you'd rather see a 1-line-per-group summary, replace mutate with summarize :

mdf %>% group_by(Var) %>%
    summarize(p_present = mean(PA == "Present"),
           p_absent = mean(PA == "Absent"))
# # A tibble: 2 × 3
#      Var p_present p_absent
#    <chr>     <dbl>    <dbl>
# 1 Var1PA     0.625    0.375
# 2 Var2PA     0.500    0.500

My solution for this

library(ggplot2)
library(reshape)
library(dplyr)

df <- data.frame(
  SampleID = c(1, 2, 3, 4, 5, 6, 7, 8),
  Var1 = c(0.1, 0.5, 0.7, 0, 0, 0, 0.5, 0.2),
  Var1PA = c("Present", "Present", "Present", "Absent", "Absent", "Absent", "Present", "Present"),
  Var2 = c(0, 0, 0, 0, 0.1, 0.5, 0.7, 0.2),
  Var2PA = c("Absent", "Absent", "Absent", "Absent", "Present", "Present", "Present", "Present")
)

reshape::melt(df, c('SampleID')) |> 
  filter(variable == 'Var1' | variable == 'Var2') |> 
  mutate(value1 = ifelse(value == 0, 'Absent', 'Present')) |> 
  group_by(variable) |> count(variable, value1) |> 
  mutate(
    prc = n/sum(n)
  ) |>  as.data.frame() |> 
  ggplot( aes(x = variable, y = prc, fill = value1)) +
    geom_bar(stat = 'identity', position = 'fill', width = 0.7) +
    scale_y_continuous(labels = scales::percent) +
    labs(fill = 'Presence status') +
    geom_text(aes(x = variable, y = prc, label = stat(y)),
              position = position_fill(vjust = 0.5))

在此处输入图像描述

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM