简体   繁体   English

如何计算r中哪些行满足条件?

[英]How to count which rows are meeting conditions in r?

I have a dataset of groups of genes that I filter to select the best gene per group with some conditions:我有一个基因组数据集,我过滤这些数据集以在某些条件下为每组选择最佳基因:

Data:数据:

 Group Gene      Score     direct_count   secondary_count 
    1   AQP11    0.5566507       4               5
    1   CLNS1A   0.2811747       0               2
    1   RSF1     0.5469924       3               6
    2   CFDP1    0.4186066       1               2
    2   CHST6    0.4295135       1               3
    3   ACE      0.634           1               1
    3   NOS2     0.6345          1               1
    4   Gene1    0.7             0               1
    4   Gene2    0.61            1               0
    4   Gene3    0.62            0               1   

Filtering:过滤:

dt %>% 
  group_by(Group) %>% 
  filter((max(Score) - Score)<0.05) %>% 
  slice_max(direct_count, n = 1) %>% 
  slice_max(secondary_count, n = 1) %>% 
  ungroup()

I am looking to be able to count how may genes are being filtered at which step in the above code.我希望能够计算在上述代码中的哪个步骤中基因是如何被过滤的。

So for example my conditions I am applying with this code are:例如,我使用此代码应用的条件是:

  1. Select the gene with the highest score if the score difference between the top scored gene and all others in the group is >0.05如果得分最高的基因与组中所有其他基因之间的得分差异> 0.05,则选择得分最高的基因

  2. If the score difference between the top gene and any other genes in a group is <0.05 then select the gene with a higher direct_count only selecting between those genes with a <0.05 distance to the top scored gene per group如果最高基因与组中任何其他基因之间的得分差异 <0.05,则选择具有更高direct_count的基因,仅在与每组最高得分基因的距离 <0.05 的基因之间进行选择

  3. If the direct_count is the same select the gene with the highest secondary_count如果direct_count相同,则选择具有最高secondary_count的基因

  4. If all counts are the same select all genes <0.05 distance to each other.如果所有计数都相同,则选择所有基因之间的距离 <0.05。

I've been able to count the genes meeting my first condition (>0.05 score) doing:我已经能够计算出满足我的第一个条件(> 0.05 分)的基因:

new_df <- dt %>% 
  group_by(Group) %>% 
  filter((max(Score) - Score)<0.05)

count1 <- new_df[!(duplicated(new_df$Group) | duplicated(new_df$Group, fromLast = TRUE)), ] 

I've been trying to apply similar rules to get counts of how many genes are meeting conditions for higher direct_count or higher secondary_count or matching direct_count and secondary_count , but different code I try gives different numbers so I'm not sure what is the best way.我一直在尝试应用类似的规则来计算有多少基因满足更高的direct_count或更高的secondary_count或匹配direct_countsecondary_count ,但是我尝试的不同代码给出了不同的数字,所以我不确定什么是最好的方法.

Input data:输入数据:

#Input data before filtering with code above:

structure(list(Group = c(1L, 1L, 1L, 2L, 2L, 3L, 3L, 4L, 4L, 4L), Gene = c("AQP11", 
"CLNS1A", "RSF1", "CFDP1", "CHST6", "ACE", "NOS2", "Gene1","Gene2","Gene3"), Score = c(0.5566507, 
0.2811747, 0.5269924, 0.4186066, 0.4295135, 0.634, 0.6345, 0.7, 0.62, 0.61), direct_count = c(4L, 
0L, 3L, 1L, 1L, 1L, 1L, 0L, 1L, 0L), secondary_count = c(5L, 2L, 6L, 2L, 
3L, 1L, 1L, 0L, 0L, 1L)), row.names = c(NA, -10L), class = c("data.table", 
"data.frame"))

#Input data after filtering with code applied above:

structure(list(Group = c(1L, 1L, 1L, 2L, 2L, 3L, 3L, 4L, 4L, 
4L), Gene = c("AQP11", "CLNS1A", "RSF1", "CFDP1", "CHST6", "ACE", 
"NOS2", "Gene1", "Gene2", "Gene3"), Score = c(0.5566507, 0.2811747, 
0.5269924, 0.4186066, 0.4295135, 0.634, 0.6345, 0.7, 0.62, 0.61
), direct_count = c(4L, 0L, 3L, 1L, 1L, 1L, 1L, 0L, 1L, 0L), 
    secondary_count = c(5L, 2L, 6L, 2L, 3L, 1L, 1L, 0L, 0L, 1L
    )), row.names = c(NA, -10L), class = c("data.table", "data.frame"
))

This example data I've made should have 1 gene group selected by >0.05 score, 1 gene group filtered by larger direct_count and 2 gene groups filtered by secondary_count .我制作的这个示例数据应该有 1 个由 >0.05 分数选择的基因组,1 个由较大的direct_count过滤的基因组和 2 个由secondary_count过滤的基因组。 Ideally I am aiming to count the number of groups and be able to pull them out of the dataset.理想情况下,我的目标是计算组的数量并能够将它们从数据集中提取出来。

Output from example would just be a count like:示例的输出只是一个计数,如:

Genes filtered by >0.05 score: 1
Genes filtered by direct_count: 1
Genes filtered by secondary_count: 2

Count数数

Basically, before and after each filter, you calculate the number of rows available and you save it in a new column.基本上,在每个过滤器之前和之后,您计算可用行数并将其保存在新列中。

library(dplyr)
dt %>% 
  group_by(Group) %>% 
  mutate(filter0 = n()) %>% 
  filter((max(Score) - Score)<0.05) %>% 
  mutate(filter1 = n()) %>% 
  slice_max(direct_count, n = 1) %>% 
  mutate(filter2 = n()) %>% 
  slice_max(secondary_count, n = 1) %>% 
  mutate(filter3 = n()) %>% 
  ungroup()

#> # A tibble: 5 x 9
#>   Group Gene  Score direct_count secondary_count filter0 filter1 filter2 filter3
#>   <int> <chr> <dbl>        <int>           <int>   <int>   <int>   <int>   <int>
#> 1     1 AQP11 0.557            4               5       3       2       1       1
#> 2     2 CHST6 0.430            1               3       2       2       2       1
#> 3     3 ACE   0.634            1               1       2       2       2       2
#> 4     3 NOS2  0.634            1               1       2       2       2       2
#> 5     4 Gene1 0.7              0               0       3       1       1       1

Show filters explicitly显式显示过滤器

Or you can keep track of the filters in this way.或者您可以通过这种方式跟踪过滤器。 Each column shows if the row was selected or not at each filtering.每列显示在每次过滤时是否选择了该行。

library(dplyr)
dt %>% 
  group_by(Group) %>% 
  mutate(filter1 = (max(Score) - Score)<0.05) %>% 
  mutate(filter2 = rank(-replace(direct_count, !filter1, -Inf), ties.method = "min") == 1) %>% 
  mutate(filter3 = rank(-replace(secondary_count, !filter2, -Inf), ties.method = "min") == 1) %>% 
  ungroup()

#> # A tibble: 10 x 8
#>    Group Gene   Score direct_count secondary_count filter1 filter2 filter3
#>    <int> <chr>  <dbl>        <int>           <int> <lgl>   <lgl>   <lgl>  
#>  1     1 AQP11  0.557            4               5 TRUE    TRUE    TRUE   
#>  2     1 CLNS1A 0.281            0               2 FALSE   FALSE   FALSE  
#>  3     1 RSF1   0.527            3               6 TRUE    FALSE   FALSE  
#>  4     2 CFDP1  0.419            1               2 TRUE    TRUE    FALSE  
#>  5     2 CHST6  0.430            1               3 TRUE    TRUE    TRUE   
#>  6     3 ACE    0.634            1               1 TRUE    TRUE    TRUE   
#>  7     3 NOS2   0.634            1               1 TRUE    TRUE    TRUE   
#>  8     4 Gene1  0.7              0               0 TRUE    TRUE    TRUE   
#>  9     4 Gene2  0.62             1               0 FALSE   FALSE   FALSE  
#> 10     4 Gene3  0.61             0               1 FALSE   FALSE   FALSE  

If you filter by the last column ( filter3 ) you actually get the same output of the dplyr pipe you shared in your question.如果您按最后一列 ( filter3 ) 进行过滤,您实际上会获得您在问题中共享的dplyr管道的相同输出。

library(dplyr)
dt %>% 
  group_by(Group) %>% 
  mutate(filter1 = (max(Score) - Score)<0.05) %>% 
  mutate(filter2 = rank(-replace(direct_count, !filter1, -Inf), ties.method = "min") == 1) %>% 
  mutate(filter3 = rank(-replace(secondary_count, !filter2, -Inf), ties.method = "min") == 1) %>% 
  ungroup() %>%
  filter(filter3)
#> # A tibble: 5 x 8
#>   Group Gene  Score direct_count secondary_count filter1 filter2 filter3
#>   <int> <chr> <dbl>        <int>           <int> <lgl>   <lgl>   <lgl>  
#> 1     1 AQP11 0.557            4               5 TRUE    TRUE    TRUE   
#> 2     2 CHST6 0.430            1               3 TRUE    TRUE    TRUE   
#> 3     3 ACE   0.634            1               1 TRUE    TRUE    TRUE   
#> 4     3 NOS2  0.634            1               1 TRUE    TRUE    TRUE   
#> 5     4 Gene1 0.7              0               0 TRUE    TRUE    TRUE  

Visual Aid视觉辅助

If it is easier for you to visually see how filters evolve, remember you can split your data with group_split , like this:如果您更容易直观地看到过滤器如何演变,请记住您可以使用group_split拆分数据,如下所示:

library(dplyr)
dt %>% 
  group_by(Group) %>% 
  mutate(filter1 = (max(Score) - Score)<0.05) %>% 
  mutate(filter2 = rank(-replace(direct_count, !filter1, -Inf), ties.method = "min") == 1) %>% 
  mutate(filter3 = rank(-replace(secondary_count, !filter2, -Inf), ties.method = "min") == 1) %>% 
  group_split()

OUTPUT:输出:

<list_of<
  tbl_df<
    Group          : integer
    Gene           : character
    Score          : double
    direct_count   : integer
    secondary_count: integer
    filter1        : logical
    filter2        : logical
    filter3        : logical
  >
>[4]>
[[1]]
# A tibble: 3 x 8
  Group Gene   Score direct_count secondary_count filter1 filter2 filter3
  <int> <chr>  <dbl>        <int>           <int> <lgl>   <lgl>   <lgl>  
1     1 AQP11  0.557            4               5 TRUE    TRUE    TRUE   
2     1 CLNS1A 0.281            0               2 FALSE   FALSE   FALSE  
3     1 RSF1   0.527            3               6 TRUE    FALSE   FALSE  

[[2]]
# A tibble: 2 x 8
  Group Gene  Score direct_count secondary_count filter1 filter2 filter3
  <int> <chr> <dbl>        <int>           <int> <lgl>   <lgl>   <lgl>  
1     2 CFDP1 0.419            1               2 TRUE    TRUE    FALSE  
2     2 CHST6 0.430            1               3 TRUE    TRUE    TRUE   

[[3]]
# A tibble: 2 x 8
  Group Gene  Score direct_count secondary_count filter1 filter2 filter3
  <int> <chr> <dbl>        <int>           <int> <lgl>   <lgl>   <lgl>  
1     3 ACE   0.634            1               1 TRUE    TRUE    TRUE   
2     3 NOS2  0.634            1               1 TRUE    TRUE    TRUE   

[[4]]
# A tibble: 3 x 8
  Group Gene  Score direct_count secondary_count filter1 filter2 filter3
  <int> <chr> <dbl>        <int>           <int> <lgl>   <lgl>   <lgl>  
1     4 Gene1  0.7             0               0 TRUE    TRUE    TRUE   
2     4 Gene2  0.62            1               0 FALSE   FALSE   FALSE  
3     4 Gene3  0.61            0               1 FALSE   FALSE   FALSE  

But if you are more of a "visual" guy, you can plot the evolution of the filters for each group.但是,如果您更喜欢“视觉”,则可以绘制每个组的过滤器演变图。

Use geom_tile to create a heatmap of the selected rows.使用geom_tile创建所选行的热图。

The plot has to be read from left to right.情节必须从左到右阅读。 The red tiles are the ones discarded by the filter.红色瓷砖是过滤器丢弃的瓷砖。

library(ggplot2)
library(tidyr)
library(dplyr)

dt %>% 
  group_by(Group) %>% 
  mutate(filter1 = (max(Score) - Score)<0.05) %>% 
  mutate(filter2 = rank(-replace(direct_count, !filter1, -Inf), ties.method = "min") == 1) %>% 
  mutate(filter3 = rank(-replace(secondary_count, !filter2, -Inf), ties.method = "min") == 1) %>% 
  
  select(Group, Gene, starts_with("filter")) %>% 
  pivot_longer(starts_with("filter")) %>% 
  
  ggplot() +
  geom_tile(aes(x = name, y = Gene, fill = value), colour = "black") +
  facet_wrap("Group", scales = "free") +
  labs(title = "Gene selected from left to right",
       x = "Filters",
       y = "Genes",
       fill = "Selected")

在此处输入图片说明


Definitive filter最终过滤器

Following, I'll leave the code to see how many Genes were selected at each step.接下来,我将留下代码,看看每一步选择了多少基因。

Also, as last columns, you can see at which filter you got down to the minimum number of genes selected at the end, in this way you can see how many times each filter was the definitive one.此外,作为最后一列,您可以看到在哪个过滤器中选择的基因数量最少,这样您就可以看到每个过滤器是最终选择的次数。

library(dplyr)
dt1 <- dt %>% 
  group_by(Group) %>% 
  mutate(filter0 = n()) %>% 
  mutate(filter1 = (max(Score) - Score)<0.05) %>% 
  mutate(filter2 = rank(-replace(direct_count, !filter1, -Inf), ties.method = "min") == 1) %>% 
  mutate(filter3 = rank(-replace(secondary_count, !filter2, -Inf), ties.method = "min") == 1) %>% 

  # sum the number of genes selected for each filter
  group_by(Group) %>% 
  summarise(across(starts_with("filter"), sum)) %>% 
  
  # show the number of the decisive filter!
  rowwise() %>% 
  mutate(definitive = which.min(c_across(starts_with("filter")))-1) %>%
  ungroup()

dt1
#> # A tibble: 4 x 6
#>   Group filter0 filter1 filter2 filter3 definitive
#>   <int>   <int>   <int>   <int>   <int>      <dbl>
#> 1     1       9       2       1       1          2
#> 2     2       4       2       2       1          3
#> 3     3       4       2       2       2          1
#> 4     4       9       1       1       1          1

count(dt1, definitive)
#> # A tibble: 3 x 2
#>   definitive     n
#>        <dbl> <int>
#> 1          1     2
#> 2          2     1
#> 3          3     1

ggplot(dt1) + geom_bar(aes(definitive))

在此处输入图片说明

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM