I'm trying to aggregate (consolidate) a table that resulted from using rbind to join several data frames. I need to consolidate the rows that have the same values in several columns and sum the count for these rows.
For simplification, I'm displaying a sample of the table.
count freq cdr3nt cdr3aa v d j VEnd DStart DEnd JStart
1. 5344 0.160 TGGGTCAACTAA CASSQRD TRBV14 TRBD1 TRBJ2-1 18 -1 18 27
2. 245 0.022 TGGACTAATCAG CAQSTRTT TRBV27-1 TRBD2 TRBJ2-5 12 17 -1 19
3. 120 0.010 TAGGGAGGC CASTT TRBV7-2 TRBD1 TRBJ1-5 10 19 -1 34
4. 102 0.010 TGGACTAATCAG CAQSTRTT TRBV27-1 TRBD2 TRBJ2-5 12 17 -1 19
5. 52 0.001 TGGGTCAACTAA CASSQRD TRBV14 TRBD1 TRBJ2-1 18 -1 18 27
6. 51 0.001 TGCGGGAA CGSSST TRBV4-3 TRBD2 TRBJ1-3 27 10 26 24
If the columns for cdr3nt, cdr3aa, v, d, j, VEnd, DStart, DEnd, and JStart have the same values, the count values for those columns should be summed, and only one row be displayed with the information. In addition, I would need to recalculate the frequency values for the rows consolidated by dividing the resulting count by the total sum of the counts for the table. The resulting table should look like this:
count freq cdr3nt cdr3aa v d j VEnd DStart DEnd JStart
5396 0.18 TGGGTCAACTAA CASSQRD TRBV14 TRBD1 TRBJ2-1 18 -1 18 27
347 0.034 TGGACTAATCAG CAQSTRTT TRBV27-1 TRBD2 TRBJ2-5 12 17 -1 19
120 0.010 TAGGGAGGC CASTT TRBV7-2 TRBD1 TRBJ1-5 10 19 -1 34
51 0.001 TGCGGGAA CGSSST TRBV4-3 TRBD2 TRBJ1-3 27 10 26 24
Rows 1.and 5. were consolidated, as well as rows 2. and 4. Their counts were added and freq was recalculated. In the real-life version, there may be more than two rows with the same values.
I started out with the following piece of code trying to use the aggregate function but ran into trouble right off the bat. I didn't even bother to do the new frequency calculation.
samrep1 <- read.table("/Data/tables_merge/JB-3_R1.txt", header=TRUE, sep="\t")
samrep2 <- read.table("/Data/tables_merge/JB-3_R2.txt", header=TRUE, sep="\t")
samrep3 <- read.table("/Data/tables_merge/JB-3_R3.txt", header=TRUE, sep="\t")
samrep4 <- read.table("/Data/tables_merge/JB-3_R4.txt", header=TRUE, sep="\t")
table2 <- rbind(samrep1, samrep2)
table3 <- rbind(table2, samrep3)
table4 <- rbind(table3, samrep4)
agg_table <- aggregate(table4, by=list(table4$cdr3nt), FUN = sum)
Any help will be greatly appreciated.
Instead of creating multiple objects in the global env, we can read it in a list
library(dplyr)
library(purrr)
out <- list.files(path = "/Data/tables_merge", pattern = "^JB-\\d+_R\\d+\\.txt",
full.names = TRUE) %>%
map_dfr(read.table, header = TRUE, sep="\t") %>%
group_by(cdr3nt, cdr3aa, v, d, j, VEnd, DStart, DEnd, JStart) %>%
summarise(Count = sum(count), freq = Count/sum(.$count))
This should be possible with package dplyr
and function group_by
, summarize
where summarize will aggregate the values.
library(dplyr)
dta %>% mutate(total = sum(count)) %>%
group_by(cdr3nt, cdr3aa, v, d, j, VEnd, DStart, DEnd, JStart) %>%
summarize(count_new = sum(count), freq = count_new/mean(total))
Consider the formula version of aggregate
wrapped in within
for freq calculation:
final_df <- do.call(rbind, list(samrep1, samrep2, samrep3, samrep4))
agg_df <- within(aggregate(count ~ cdr3nt + cdr3aa + v + d + j + VEnd + DStart + DEnd + JStart,
data = final_df,
FUN = length),
freq <- count / sum(count)
)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.