简体   繁体   中英

Aggregate a table by several columns in r

I'm trying to aggregate (consolidate) a table that resulted from using rbind to join several data frames. I need to consolidate the rows that have the same values in several columns and sum the count for these rows.

For simplification, I'm displaying a sample of the table.

    count  freq   cdr3nt        cdr3aa    v          d     j       VEnd DStart  DEnd  JStart
 1. 5344   0.160  TGGGTCAACTAA  CASSQRD   TRBV14    TRBD1  TRBJ2-1  18    -1     18     27  
 2. 245    0.022  TGGACTAATCAG  CAQSTRTT  TRBV27-1  TRBD2  TRBJ2-5  12    17     -1     19
 3. 120    0.010  TAGGGAGGC     CASTT     TRBV7-2   TRBD1  TRBJ1-5  10    19     -1     34
 4. 102    0.010  TGGACTAATCAG  CAQSTRTT  TRBV27-1  TRBD2  TRBJ2-5  12    17     -1     19
 5. 52     0.001  TGGGTCAACTAA  CASSQRD   TRBV14    TRBD1  TRBJ2-1  18    -1     18     27
 6. 51     0.001  TGCGGGAA      CGSSST    TRBV4-3   TRBD2  TRBJ1-3  27    10     26     24      

If the columns for cdr3nt, cdr3aa, v, d, j, VEnd, DStart, DEnd, and JStart have the same values, the count values for those columns should be summed, and only one row be displayed with the information. In addition, I would need to recalculate the frequency values for the rows consolidated by dividing the resulting count by the total sum of the counts for the table. The resulting table should look like this:

 count  freq   cdr3nt        cdr3aa    v          d     j       VEnd DStart  DEnd  JStart
 5396   0.18  TGGGTCAACTAA  CASSQRD   TRBV14    TRBD1  TRBJ2-1  18    -1     18     27  
 347    0.034 TGGACTAATCAG  CAQSTRTT  TRBV27-1  TRBD2  TRBJ2-5  12    17     -1     19
 120    0.010 TAGGGAGGC     CASTT     TRBV7-2   TRBD1  TRBJ1-5  10    19     -1     34
 51     0.001 TGCGGGAA      CGSSST    TRBV4-3   TRBD2  TRBJ1-3  27    10     26     24

Rows 1.and 5. were consolidated, as well as rows 2. and 4. Their counts were added and freq was recalculated. In the real-life version, there may be more than two rows with the same values.

I started out with the following piece of code trying to use the aggregate function but ran into trouble right off the bat. I didn't even bother to do the new frequency calculation.

 samrep1 <- read.table("/Data/tables_merge/JB-3_R1.txt", header=TRUE, sep="\t")
 samrep2 <- read.table("/Data/tables_merge/JB-3_R2.txt", header=TRUE, sep="\t")
 samrep3 <- read.table("/Data/tables_merge/JB-3_R3.txt", header=TRUE, sep="\t")
 samrep4 <- read.table("/Data/tables_merge/JB-3_R4.txt", header=TRUE, sep="\t")

 table2 <- rbind(samrep1, samrep2)
 table3  <- rbind(table2, samrep3)
 table4 <- rbind(table3, samrep4)

 agg_table <- aggregate(table4, by=list(table4$cdr3nt), FUN = sum)

Any help will be greatly appreciated.

Instead of creating multiple objects in the global env, we can read it in a list

library(dplyr)
library(purrr)
out <- list.files(path = "/Data/tables_merge", pattern = "^JB-\\d+_R\\d+\\.txt",
           full.names = TRUE) %>%
       map_dfr(read.table, header = TRUE, sep="\t") %>%
        group_by(cdr3nt, cdr3aa, v, d, j, VEnd, DStart, DEnd, JStart) %>%
        summarise(Count = sum(count), freq = Count/sum(.$count))

This should be possible with package dplyr and function group_by , summarize where summarize will aggregate the values.

library(dplyr)

dta %>% mutate(total = sum(count)) %>% 
group_by(cdr3nt, cdr3aa, v, d, j, VEnd, DStart, DEnd, JStart) %>%
summarize(count_new = sum(count), freq = count_new/mean(total))

Consider the formula version of aggregate wrapped in within for freq calculation:

final_df <- do.call(rbind, list(samrep1, samrep2, samrep3, samrep4))

agg_df <- within(aggregate(count ~ cdr3nt + cdr3aa + v + d + j + VEnd + DStart + DEnd + JStart, 
                           data = final_df, 
                           FUN = length),
                 freq <- count / sum(count)
          )

Online Demo

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM