I'm trying to develop a table from a series of lists that I've converted to data frames. Each list is comprised of character strings and their count. Each character string varies between 7 and 20 (or more). Each list has a header that identifies the source of the strings. I have 66 lists (sources). Each list contains over 5,000 strings. Not every string is contained in every list, so the number of strings in the lists varies. Here's an example of the structure of a single list.
$PreAg_18_2
CDR3.aa Clones
<chr> <int>
CASSYGTAYTGELFF 1623
CASSRGDSDNSPLHF 1440
CASSREKAFF 1161
CSGMGALAKNIQYF 949
CSAYTGLSYEQYF 813
CASSLSLAVNSPLHF 634
CAIRDTPGSPQHF 574
CATGQVNTEAFF 555
CASSLKGQGGSPLHF 499
CASSYSRSPQPQHF 478
I want to combine the results in a single table showing the counts (clones) with all the strings (CDR3.aa) listed on the y-axis and the each lists header (Sample.Id) on the x-axis. An example would be:
10_pep_10_1 preAg_10_2 Dec_2_18_1 …...
CASSYGTAYTGELFF 1623 234 0
CASSRGDSDNSPLHF 1440 522 28
CASSREKAFF 1161 445 50
CSGMGALAKNIQYF 949 24 0
CASSYSRSPQPQHF 478 0 398
.
.
I'm able to generate a single individual list containing the as in the example, and I'm thinking that converting the lists to data frames is a better way to manipulate them, but I'm having trouble consolidating them against a single list of all the strings and moving the sample.id to the x-axis. I'm thinking I unlist and join all the strings into a single df, but I'm not sure how to keep the counts matched to the strings. Is there a function in R that will help me do this? Or is it unavoidable to develop a loop?
So far I've been able to generate a global list of strings, but I now need to match the counts by header (sample.id). Not sure how to approach this.
library(immunarch)
library(stringr)
library(plyr)
immdata = repLoad("/mnt/data/Development/Analysis_Script/input_files/")
all <- immdata$data
# Get list headers (names)
sample.id <- names(all)
# make new variable for extraction of clones
all.c <- all
# Get list of clones and filter for unique clones per list.
for (i in 1:length(all.c)){
all.c[[i]]$Sample.ID<-names(all.c)[i]
all.c[[i]]<-all.c[[i]][,c("CDR3.aa", "Clones")]
}
# bysamp is a list (vector) of the samples and their clones
bysamp <- split(all.c, sample.id, sep=" ")
# make vector of all clones
all.clones <- unlist(all.c, use.names=FALSE)
# a list of the aggregate of all the clones in all the samples.
all.clones
# Removes clone repeats
all.clones.u <- unique(all.clones)
# convert list of clones and sample.ids to data frame
all.clones.u <- data.frame(all.clones.u)
sample.id <- data.frame(sample.id)
# Addtional code here:
See summary above for expected matrix (table)
Here's a solution based on my best guess as to the structure of your data (it sounds familiar as I'm surrounded by immunologists). The key is to add a variable to each source that will keep track of the source. The sources (list/data.frames) can then be combined into a single data.frame and processed further.
First, set a random number seed for a reproducible example.
set.seed(1234)
Create a simplified artificial data set. This will consist of 6 sources (list/data.frames). Each data.frame has two variables named aa
and clones
. Three randomly selected letters from A, B and C serve as the CDR3 amino acids in each of 12 possible aa
values. The count of each clone is stored in clones
and was set as a random number between 10 and 20. Finally, each of the 6 lists/data.frames is given a name. Instead of "10_pep_10_1" I use source_1, source_2, etc.
Hopefully this has replicated the data you face. By using just 3 possible amino acids, this example ensures that the same sequence has a good chance of occurring a few times in the different lists.
# generate sample data
spl <- replicate(6, { # the braces '{}' define an expression to be repeated
n <- 12 # number of aa values in each list
aa <- replicate(n,
paste(sample(LETTERS[1:3], 3, replace = T), collapse = ""))
clones <- sample(10:20, n, replace = T)
data.frame(aa, clones)}, # this is the 'return' value of the expression
simplify = FALSE) # this ensures that the result remains as a list
# name each list
names(spl) <- paste("source", seq_along(spl), sep = "_")
Examine the first of the 6 data.frames.
head(spl$source_1)
> aa clones
> 1 ABB 12
> 2 BCB 12
> 3 AAB 20
> 4 BCB 18
> 5 ACA 16
> 6 CAA 17
Add a new variable named source
to each list/data.frame that holds the name of the source. This is easily done with a simple for
loop. Show the change in the first data.frame.
for (i in seq_along(spl)) spl[[i]]$source <- names(spl)[i]
head(spl$source_1) # or head(spl[[1]])
> aa clones source
> 1 ABB 12 source_1
> 2 BCB 12 source_1
> 3 AAB 20 source_1
> 4 BCB 18 source_1
> 5 ACA 16 source_1
> 6 CAA 17 source_1
Now, combine each of the list/data.frames into a single data.frame with the variable source
keeping track of which list/data.frame contributed the values. Then use base functions to tally the number ( clones
) of each peptide ( aa
) and source
. The result, stored in res
is another data.frame. A contingency table of counts will be generated from this. Often this is combined into a single step. See the help file for aggregate()
for more information. A popular approach for this sort of data wrangling is through the dplyr
package.
dat <- do.call(rbind, spl)
res <- aggregate(clones ~ aa + source, dat, sum)
tbl <- xtabs(clones ~ aa + source, res)
# this operation is rather common and often is done in one line:
tbl <- xtabs(clones ~ ., aggregate(clones ~ ., dat, sum))
head(tbl, 10)
> source
> aa source_1 source_2 source_3 source_4 source_5 source_6
> AAA 29 0 46 0 0 14
> AAB 20 0 0 0 0 0
> ABB 12 14 13 0 0 0
> ACA 16 23 16 0 0 0
> ACB 13 19 15 0 0 0
> BAA 17 0 0 55 16 33
> BAC 15 19 19 0 34 0
> BCB 30 0 0 68 38 15
> CAA 17 11 0 0 0 0
> CCA 15 0 0 0 0 0
The order of entries in the table is simple the order inherited during rbind
. One can change that by reorganizing the table. Here, the rows are sorted.
ord <- order(rownames(tbl))
head(tbl[ord,], 10)
> source
> aa source_1 source_2 source_3 source_4 source_5 source_6
> AAA 29 0 46 0 0 14
> AAB 20 0 0 0 0 0
> AAC 0 19 19 0 0 31
> ABA 0 11 0 0 15 18
> ABB 12 14 13 0 0 0
> ACA 16 23 16 0 0 0
> ACB 13 19 15 0 0 0
> ACC 0 11 16 0 15 0
> BAA 17 0 0 55 16 33
> BAB 0 15 0 0 0 0
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.