简体   繁体   English

一种通过填充子集在堆叠条中排序 x 轴的整洁方法

[英]A tidy way to order x-axis in stacked bar by subset of fill

I have a dataframe GTs.df of genotypes for 8 genes across 8 different genetic lines .我有一个dataframe GTs.df 8 个不同基因系的 8 个基因基因型 "NA" s represent ambiguous sequencing calls. "NA"代表不明确的排序调用。 (There are few heterozygotes "Aa" because these are inbred lines). (杂合子"Aa"很少,因为这些是近交系)。

GTs.df <- data.frame(Gene = rep(c("Zm1","Zm2","Zm3","Zm4","Zm5","Zm6","Zm7","Zm8"), each=8),
  Line = rep(c("L1", "L2", "L3", "L4", "L5", "L6", "L7", "L8"), times = 8),
  Genotype = c(rep(c("aa", "Aa", "AA", "NA"), times = c(2, 1, 5, 0)),
               rep(c("aa", "Aa", "AA", "NA"), times = c(4, 0, 1, 3)),
               rep(c("aa", "Aa", "AA", "NA"), times = c(4, 1, 3, 0)),
               rep(c("aa", "Aa", "AA", "NA"), times = c(3, 0, 4, 1)),
               rep(c("aa", "Aa", "AA", "NA"), times = c(4, 0, 3, 1)),
               rep(c("aa", "Aa", "AA", "NA"), times = c(5, 1, 2, 0)),
               rep(c("aa", "Aa", "AA", "NA"), times = c(1, 0, 3, 4)),
               rep(c("aa", "Aa", "AA", "NA"), times = c(1, 1, 6, 0))
               )
  )

I want to compare the distribution of genotypes across the lines for each gene , so I make this stacked bar plot initially:我想比较每个基因的基因型分布,所以我最初制作了这个堆叠条 plot

GTs.df %>%
  filter(Genotype != "NA") %>%
  mutate(Genotype = fct_relevel(Genotype, 
                                  c("AA", "Aa", "aa"))) %>%
  ggplot() +
  aes(x = Gene,
      fill = Genotype) +
  geom_bar(position = "stack",
           stat = "count") + 
  ylab("Number of Lines")

在此处输入图像描述

But the problem is that I want the Genes/columns ordered by number of "aa" so that it is more readable .但问题是我希望基因/列"aa"的数量排序,以便更具可读性 I can reorder the Genes via fct_reorder as suggested by tbradley / 48748250/FilipW and demonstrated below...我可以按照tbradley / 48748250/FilipW的建议通过fct_reorder重新排序基因,并在下面演示......

GTs.df %>%
  filter(Genotype != "NA") %>%
  mutate(Genotype = fct_relevel(Genotype, 
                                  c("AA", "Aa", "aa")),
         Gene = fct_reorder(Gene,
                            as.numeric(Genotype),
                            .fun = mean)
         ) %>%
  ggplot() +
  aes(x = Gene,
      fill = Genotype) +
  geom_bar(position = "stack",
           stat = "count") + 
  ylab("Number of Lines")

在此处输入图像描述

As you can see, this does order the Genes/columns pretty well via sorting by proportion , but this is imperfect in this case because of missing data points and greater than 2 levels .如您所见,这确实通过按比例排序很好地对基因/列进行了排序,但在这种情况下,由于缺少数据点大于 2 个级别,这是不完美的。 You can see the last Gene (Zm2) has fewer "aa" lines than the Gene before it but does have a higher proportion/mean of "aa".您可以看到最后一个基因 (Zm2) 的“aa”行比之前的基因少,但“aa”的比例/平均值更高。

I also tried a variation of this using sum instead of mean .我还尝试了使用sum而不是mean的变体。

GTs.df %>%
  filter(Genotype != "NA") %>%
  mutate(Genotype = fct_relevel(Genotype, 
                                  c("AA", "Aa", "aa")),
         Gene = fct_reorder(Gene,
                            as.numeric(Genotype),
                            .fun = sum)
         ) %>%
  ggplot() +
  aes(x = Gene,
      fill = Genotype) +
  geom_bar(position = "stack",
           stat = "count") + 
  ylab("Number of Lines")

在此处输入图像描述

It also almost works, but is still imperfect .它也几乎可以工作,但仍然不完美 Gene Zm4 has fewer "aa"s than the column before it, I guess because Zm4 has more total datapoints to contribute to the sum.基因 Zm4 的“aa”比它前面的列少,我猜是因为 Zm4 有更多的总数据点来贡献总和。

Ideally , I would want to use some sort of count function instead, but neither n or count work for me, no matter what class I change Genotype to.理想情况下,我想使用某种计数 function来代替,但是无论我将Genotype更改为什么classncount都不适合我。 (Many combos so I spared the long, depressing list of error messages). (许多组合,所以我省去了长长的、令人沮丧的错误消息列表)。

I did find a non-tidy solution from 48748250/talat that arranges the columns by count/absolute frequency of "aa" as desired :我确实从48748250/talat找到了一个不整洁的解决方案,它根据需要按“aa”的计数/绝对频率排列列:

gene_lvls <- names(sort(table(GTs.df[GTs.df$Genotype == "aa", "Gene"])))

GTs.df %>%
  filter(Genotype != "NA") %>%
  mutate(Genotype = fct_relevel(Genotype, 
                                  c("AA", "Aa", "aa"))) %>%
  ggplot() +
  aes(x = factor(Gene, 
                 levels = gene_lvls),
      fill = Genotype) +
  geom_bar(position = "stack",
           stat = "count") + 
  ylab("Number of Lines")

在此处输入图像描述

But I am hoping there's a tidy/dplyr/forcat-friendly way to achieve this, partly for learning/understanding and partly for pickiness/aethetic pleasure .但我希望有一种tidy/dplyr/forcat 友好的方式来实现这一点,部分是为了学习/理解,部分是为了挑剔/审美乐趣 Based on the number of similar forum questions, I have a feeling other people would be pleased by such a solution too.根据类似论坛问题的数量,我感觉其他人也会对这样的解决方案感到满意。 Bonus points if the solution has a secondary filter/tie-breaker when multiple columns have equal number of "aa" , as demonstrated by Zm2, Zm3 and Zm5 in the above plot.如果解决方案在多个列具有相同数量的“aa”时具有辅助过滤器/决胜局,则可以加,如上述 plot 中的 Zm2、Zm3 和 Zm5 所示。

Thank you in advance for your time and effort!提前感谢您的时间和精力!

Here are some other forum pages that are somewhat related:以下是其他一些相关的论坛页面:

R ggplot2 Reorder stacked plot? R ggplot2 重新订购堆叠的 plot?

How to control ordering of stacked bar chart using identity on ggplot2 如何使用 ggplot2 上的身份控制堆叠条形图的排序

sort columns with categorical variables by numerical varables in stacked barplot 通过堆积条形图中的数值变量对具有分类变量的列进行排序

Here's one approach using fct_inorder after calculating some Gene-wise metrics like # of aa an total number of lines.这是在计算一些基因方面的指标(如 # of aa 总行数)之后使用fct_inorder的一种方法。 This provides a pretty flexible way of creating whatever sorting metric you want, which could involve multiple tie-breakers.这提供了一种非常灵活的方式来创建您想要的任何排序指标,这可能涉及多个决胜局。

GTs.df %>%
  filter(Genotype != "NA") %>%
  mutate(Genotype = fct_relevel(Genotype, 
                                c("AA", "Aa", "aa"))) %>%
  group_by(Gene) %>%
  mutate(num_aa = sum(Genotype == "aa"),
         ttl_lines = n()) %>%
  ungroup() %>%
  arrange(num_aa, ttl_lines) %>%        # Define your tie-breakers here
  mutate(Gene = fct_inorder(Gene)) %>%  # Assign factor in order of appearance 
  
  ggplot() +
  aes(x = Gene,
      fill = Genotype) +
  geom_bar(position = "stack",
           stat = "count") + 
  ylab("Number of Lines")

在此处输入图像描述

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM