简体   繁体   English

通过 r 中的几列聚合一个表

[英]Aggregate a table by several columns in r

I'm trying to aggregate (consolidate) a table that resulted from using rbind to join several data frames.我正在尝试聚合(合并)由使用 rbind 连接多个数据框而产生的表。 I need to consolidate the rows that have the same values in several columns and sum the count for these rows.我需要合并多列中具有相同值的行,并对这些行的计数求和。

For simplification, I'm displaying a sample of the table.为简单起见,我展示了一个表格样本。

    count  freq   cdr3nt        cdr3aa    v          d     j       VEnd DStart  DEnd  JStart
 1. 5344   0.160  TGGGTCAACTAA  CASSQRD   TRBV14    TRBD1  TRBJ2-1  18    -1     18     27  
 2. 245    0.022  TGGACTAATCAG  CAQSTRTT  TRBV27-1  TRBD2  TRBJ2-5  12    17     -1     19
 3. 120    0.010  TAGGGAGGC     CASTT     TRBV7-2   TRBD1  TRBJ1-5  10    19     -1     34
 4. 102    0.010  TGGACTAATCAG  CAQSTRTT  TRBV27-1  TRBD2  TRBJ2-5  12    17     -1     19
 5. 52     0.001  TGGGTCAACTAA  CASSQRD   TRBV14    TRBD1  TRBJ2-1  18    -1     18     27
 6. 51     0.001  TGCGGGAA      CGSSST    TRBV4-3   TRBD2  TRBJ1-3  27    10     26     24      

If the columns for cdr3nt, cdr3aa, v, d, j, VEnd, DStart, DEnd, and JStart have the same values, the count values for those columns should be summed, and only one row be displayed with the information.如果 cdr3nt、cdr3aa、v、d、j、VEnd、DStart、DEnd 和 JStart 的列具有相同的值,则应对这些列的计数值求和,并且仅显示一行信息。 In addition, I would need to recalculate the frequency values for the rows consolidated by dividing the resulting count by the total sum of the counts for the table.此外,我需要通过将结果计数除以表的计数总和来重新计算合并行的频率值。 The resulting table should look like this:结果表应如下所示:

 count  freq   cdr3nt        cdr3aa    v          d     j       VEnd DStart  DEnd  JStart
 5396   0.18  TGGGTCAACTAA  CASSQRD   TRBV14    TRBD1  TRBJ2-1  18    -1     18     27  
 347    0.034 TGGACTAATCAG  CAQSTRTT  TRBV27-1  TRBD2  TRBJ2-5  12    17     -1     19
 120    0.010 TAGGGAGGC     CASTT     TRBV7-2   TRBD1  TRBJ1-5  10    19     -1     34
 51     0.001 TGCGGGAA      CGSSST    TRBV4-3   TRBD2  TRBJ1-3  27    10     26     24

Rows 1.and 5. were consolidated, as well as rows 2. and 4. Their counts were added and freq was recalculated.第 1. 和第 5. 行以及第 2. 和第 4 行被合并。它们的计数被添加并重新计算频率。 In the real-life version, there may be more than two rows with the same values.在现实版本中,可能有两行以上具有相同的值。

I started out with the following piece of code trying to use the aggregate function but ran into trouble right off the bat.我开始使用以下代码尝试使用聚合函数,但立即遇到了麻烦。 I didn't even bother to do the new frequency calculation.我什至懒得做新的频率计算。

 samrep1 <- read.table("/Data/tables_merge/JB-3_R1.txt", header=TRUE, sep="\t")
 samrep2 <- read.table("/Data/tables_merge/JB-3_R2.txt", header=TRUE, sep="\t")
 samrep3 <- read.table("/Data/tables_merge/JB-3_R3.txt", header=TRUE, sep="\t")
 samrep4 <- read.table("/Data/tables_merge/JB-3_R4.txt", header=TRUE, sep="\t")

 table2 <- rbind(samrep1, samrep2)
 table3  <- rbind(table2, samrep3)
 table4 <- rbind(table3, samrep4)

 agg_table <- aggregate(table4, by=list(table4$cdr3nt), FUN = sum)

Any help will be greatly appreciated.任何帮助将不胜感激。

Instead of creating multiple objects in the global env, we can read it in a list我们可以在list读取它,而不是在全局环境中创建多个对象

library(dplyr)
library(purrr)
out <- list.files(path = "/Data/tables_merge", pattern = "^JB-\\d+_R\\d+\\.txt",
           full.names = TRUE) %>%
       map_dfr(read.table, header = TRUE, sep="\t") %>%
        group_by(cdr3nt, cdr3aa, v, d, j, VEnd, DStart, DEnd, JStart) %>%
        summarise(Count = sum(count), freq = Count/sum(.$count))

This should be possible with package dplyr and function group_by , summarize where summarize will aggregate the values.这应该可以使用包dplyr和函数group_bysummarize汇总值的位置。

library(dplyr)

dta %>% mutate(total = sum(count)) %>% 
group_by(cdr3nt, cdr3aa, v, d, j, VEnd, DStart, DEnd, JStart) %>%
summarize(count_new = sum(count), freq = count_new/mean(total))

Consider the formula version of aggregate wrapped in within for freq calculation:考虑的式版本aggregate包裹在within频率计算:

final_df <- do.call(rbind, list(samrep1, samrep2, samrep3, samrep4))

agg_df <- within(aggregate(count ~ cdr3nt + cdr3aa + v + d + j + VEnd + DStart + DEnd + JStart, 
                           data = final_df, 
                           FUN = length),
                 freq <- count / sum(count)
          )

Online Demo在线演示

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM