简体   繁体   English

ggplot2 R 中不相关变量的堆叠条形图将变量转换为基于存在不存在的百分比

[英]stacked barplot converting a variable into a presence absence based percentage for unrelated variables in ggplot2 R

Following is a sample data frame以下是示例数据框

df <- data.frame(SampleID = c(1, 2, 3, 4, 5, 6, 7, 8),
                 Var1 = c(0.1 , 0.5,    0.7,    0,  0,  0,  0.5,    0.2), 
                 Var1PA = c("Present", "Present", "Present", "Absent", "Absent", "Absent",  "Present", "Present"), 
                 Var2 = c(0, 0, 0, 0, 0.1, 0.5, 0.7, 0.2), 
                 Var2PA = c("Absent", "Absent", "Absent", "Absent", "Present", "Present", "Present", "Present"))

My question started off as seemingly simple, but I could not find a way to edit the dataframe suitably to plot a barplot.我的问题一开始看起来很简单,但我找不到将 dataframe 适当地编辑为 plot 条形图的方法。

For Var1, I want to plot a stacked barplot of the percent of times var1 was present in the sample (ie var1 value > 0) or absent (Similarly for var2 and so on).对于 Var1,我想要 plot 样本中存在 var1 的次数百分比的堆叠条形图(即 var1 值 > 0)或不存在(类似于 var2 等)。

I could determine this percentage by:我可以通过以下方式确定这个百分比:

(1 - sum(df$Var1 == 0) / length(df$Var1)) * 100

But how do I convert this into a percentage while plotting?但是如何在绘图时将其转换为百分比? I looked at many melt options, but there is no unifying criteria for these variables that would make a common X axis我查看了很多熔化选项,但对于这些变量没有统一的标准可以构成一个共同的 X 轴

Finally, how does one answer the question above if I want to plot 5 variables from a dataframe of 1000 such column variables?最后,如果我想从 dataframe 的 1000 个这样的列变量中提取 plot 5 个变量,该如何回答上述问题?

Edit: Thanks for the answers so far!编辑:感谢您到目前为止的回答! I have a slight edit to the question I just added one more variable to my data frame我对问题进行了轻微的编辑,我只是在我的数据框中添加了一个变量

df <- data.frame(SampleID = c(1, 2, 3, 4, 5, 6, 7, 8),
             Var1 = c(0.1 , 0.5,    0.7,    0,  0,  0,  0.5,    0.2), 
             Var1PA = c("Present", "Present", "Present", "Absent", "Absent", "Absent",  "Present", "Present"), 
             Var2 = c(0, 0, 0, 0, 0.1, 0.5, 0.7, 0.2), 
             Var2PA = c("Absent", "Absent", "Absent", "Absent", "Present", "Present", "Present", "Present"),
             Disease = c("Case", "Control", "Case", "Control", "Case", "Control", "Case", "Control"))

I am trying to figure out how to plot the barplot for cases and controls with presence absence stacked within them for Var1PA, Var2PA and so on.我想弄清楚如何 plot 为 Var1PA、Var2PA 等在存在缺失情况下堆叠的案例和控件的条形图。 If I have the right data frame input, the ggplot2 code would be: vars <- c('Var1PA', 'Var2PA', 'Var2PA') ##based on the first comment by @rawr tt <- data.frame(prop.table(as.table(sapply(df[, vars], table)), 2) * 100) ggplot(tt, aes(Disease, Freq)) +如果我有正确的数据框输入,ggplot2 代码将是:vars <- c('Var1PA', 'Var2PA', 'Var2PA') ##based on the first comment by @rawr tt <- data.frame(prop .table(as.table(sapply(df[, vars], table)), 2) * 100) ggplot(tt, aes(Disease, Freq)) +
geom_bar(aes(fill = Var1), position = "stack", stat="identity") + facet_grid(~vars) geom_bar(aes(fill = Var1), position = "堆栈", stat="身份") + facet_grid(~vars)

How do I get percentages for cases (present and absent) and controls (present and absent) for each of the vars?如何获得每个变量的案例(存在和不存在)和控件(存在和不存在)的百分比? Thanks!谢谢!

This should generalize nicely. 这应该很好地概括。 You can, of course, be more selective about the variables you pick. 当然,您可以对选择的变量更具选择性。

library(dplyr)
library(tidyr)
mdf = df %>% select(SampleID, ends_with("PA")) %>%
    gather(key = Var, value = PA, -SampleID) %>%
    mutate(PA = factor(PA, levels = c("Present", "Absent")))

ggplot(mdf, aes(x = Var, fill = PA)) +
    geom_bar(position = "fill") +
    scale_y_continuous(labels = scales::percent)

在此处输入图片说明

You can add the percentage columns to the long data frame: 您可以将百分比列添加到长数据框中:

mdf %>% group_by(Var) %>%
    mutate(p_present = mean(PA == "Present"),
           p_absent = mean(PA == "Absent"))
# Source: local data frame [16 x 5]
# Groups: Var [2]
# 
#    SampleID    Var      PA p_present p_absent
#       <dbl>  <chr>  <fctr>     <dbl>    <dbl>
# 1         1 Var1PA Present     0.625    0.375
# 2         2 Var1PA Present     0.625    0.375
# 3         3 Var1PA Present     0.625    0.375
# 4         4 Var1PA  Absent     0.625    0.375
# 5         5 Var1PA  Absent     0.625    0.375
# 6         6 Var1PA  Absent     0.625    0.375
# 7         7 Var1PA Present     0.625    0.375
# 8         8 Var1PA Present     0.625    0.375
# 9         1 Var2PA  Absent     0.500    0.500
# 10        2 Var2PA  Absent     0.500    0.500

Or if you'd rather see a 1-line-per-group summary, replace mutate with summarize : 或者,如果你宁愿看到一个1线每组总结,更换mutatesummarize

mdf %>% group_by(Var) %>%
    summarize(p_present = mean(PA == "Present"),
           p_absent = mean(PA == "Absent"))
# # A tibble: 2 × 3
#      Var p_present p_absent
#    <chr>     <dbl>    <dbl>
# 1 Var1PA     0.625    0.375
# 2 Var2PA     0.500    0.500

My solution for this我的解决方案

library(ggplot2)
library(reshape)
library(dplyr)

df <- data.frame(
  SampleID = c(1, 2, 3, 4, 5, 6, 7, 8),
  Var1 = c(0.1, 0.5, 0.7, 0, 0, 0, 0.5, 0.2),
  Var1PA = c("Present", "Present", "Present", "Absent", "Absent", "Absent", "Present", "Present"),
  Var2 = c(0, 0, 0, 0, 0.1, 0.5, 0.7, 0.2),
  Var2PA = c("Absent", "Absent", "Absent", "Absent", "Present", "Present", "Present", "Present")
)

reshape::melt(df, c('SampleID')) |> 
  filter(variable == 'Var1' | variable == 'Var2') |> 
  mutate(value1 = ifelse(value == 0, 'Absent', 'Present')) |> 
  group_by(variable) |> count(variable, value1) |> 
  mutate(
    prc = n/sum(n)
  ) |>  as.data.frame() |> 
  ggplot( aes(x = variable, y = prc, fill = value1)) +
    geom_bar(stat = 'identity', position = 'fill', width = 0.7) +
    scale_y_continuous(labels = scales::percent) +
    labs(fill = 'Presence status') +
    geom_text(aes(x = variable, y = prc, label = stat(y)),
              position = position_fill(vjust = 0.5))

在此处输入图像描述

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM