简体   繁体   中英

Create a table (matrix) of counts in r

I'm trying to develop a table from a series of lists that I've converted to data frames. Each list is comprised of character strings and their count. Each character string varies between 7 and 20 (or more). Each list has a header that identifies the source of the strings. I have 66 lists (sources). Each list contains over 5,000 strings. Not every string is contained in every list, so the number of strings in the lists varies. Here's an example of the structure of a single list.

$PreAg_18_2

CDR3.aa         Clones
 <chr>            <int>
CASSYGTAYTGELFF   1623
CASSRGDSDNSPLHF   1440
CASSREKAFF        1161
CSGMGALAKNIQYF     949
CSAYTGLSYEQYF      813
CASSLSLAVNSPLHF    634
CAIRDTPGSPQHF      574
CATGQVNTEAFF       555
CASSLKGQGGSPLHF    499
CASSYSRSPQPQHF     478

I want to combine the results in a single table showing the counts (clones) with all the strings (CDR3.aa) listed on the y-axis and the each lists header (Sample.Id) on the x-axis. An example would be:

            10_pep_10_1     preAg_10_2      Dec_2_18_1  …... 
CASSYGTAYTGELFF    1623         234             0
CASSRGDSDNSPLHF    1440         522             28
CASSREKAFF         1161         445             50  
CSGMGALAKNIQYF      949         24              0
CASSYSRSPQPQHF      478         0               398
.
.

I'm able to generate a single individual list containing the as in the example, and I'm thinking that converting the lists to data frames is a better way to manipulate them, but I'm having trouble consolidating them against a single list of all the strings and moving the sample.id to the x-axis. I'm thinking I unlist and join all the strings into a single df, but I'm not sure how to keep the counts matched to the strings. Is there a function in R that will help me do this? Or is it unavoidable to develop a loop?

So far I've been able to generate a global list of strings, but I now need to match the counts by header (sample.id). Not sure how to approach this.

    library(immunarch)
    library(stringr)
    library(plyr)

    immdata = repLoad("/mnt/data/Development/Analysis_Script/input_files/")

    all <- immdata$data

    # Get list headers (names)
    sample.id <- names(all)

    # make new variable for extraction of clones
    all.c <- all

    # Get list of clones and filter for unique clones per list.
    for (i in 1:length(all.c)){
        all.c[[i]]$Sample.ID<-names(all.c)[i]
        all.c[[i]]<-all.c[[i]][,c("CDR3.aa", "Clones")]
    }


    # bysamp is a list (vector) of the samples and their clones
    bysamp <- split(all.c, sample.id, sep=" ")

    # make vector of all clones
    all.clones <- unlist(all.c, use.names=FALSE)

    # a list of the aggregate of all the clones in all the samples.
    all.clones

    # Removes clone repeats
    all.clones.u <- unique(all.clones)

    # convert list of clones and sample.ids to data frame
    all.clones.u <- data.frame(all.clones.u)
    sample.id <- data.frame(sample.id)

    # Addtional code here:

See summary above for expected matrix (table)

Here's a solution based on my best guess as to the structure of your data (it sounds familiar as I'm surrounded by immunologists). The key is to add a variable to each source that will keep track of the source. The sources (list/data.frames) can then be combined into a single data.frame and processed further.

First, set a random number seed for a reproducible example.

  set.seed(1234)

Create a simplified artificial data set. This will consist of 6 sources (list/data.frames). Each data.frame has two variables named aa and clones . Three randomly selected letters from A, B and C serve as the CDR3 amino acids in each of 12 possible aa values. The count of each clone is stored in clones and was set as a random number between 10 and 20. Finally, each of the 6 lists/data.frames is given a name. Instead of "10_pep_10_1" I use source_1, source_2, etc.

Hopefully this has replicated the data you face. By using just 3 possible amino acids, this example ensures that the same sequence has a good chance of occurring a few times in the different lists.

# generate sample data
  spl <- replicate(6, { # the braces '{}' define an expression to be repeated
      n <- 12 # number of aa values in each list
      aa <- replicate(n,
        paste(sample(LETTERS[1:3], 3, replace = T), collapse = ""))
      clones <- sample(10:20, n, replace = T)
      data.frame(aa, clones)}, # this is the 'return' value of the expression
    simplify = FALSE) # this ensures that the result remains as a list

# name each list
  names(spl) <- paste("source", seq_along(spl), sep = "_")

Examine the first of the 6 data.frames.

  head(spl$source_1)
>    aa clones
> 1 ABB     12
> 2 BCB     12
> 3 AAB     20
> 4 BCB     18
> 5 ACA     16
> 6 CAA     17

Add a new variable named source to each list/data.frame that holds the name of the source. This is easily done with a simple for loop. Show the change in the first data.frame.

  for (i in seq_along(spl)) spl[[i]]$source <- names(spl)[i]

  head(spl$source_1) # or head(spl[[1]])
>    aa clones   source
> 1 ABB     12 source_1
> 2 BCB     12 source_1
> 3 AAB     20 source_1
> 4 BCB     18 source_1
> 5 ACA     16 source_1
> 6 CAA     17 source_1

Now, combine each of the list/data.frames into a single data.frame with the variable source keeping track of which list/data.frame contributed the values. Then use base functions to tally the number ( clones ) of each peptide ( aa ) and source . The result, stored in res is another data.frame. A contingency table of counts will be generated from this. Often this is combined into a single step. See the help file for aggregate() for more information. A popular approach for this sort of data wrangling is through the dplyr package.

  dat <- do.call(rbind, spl)

  res <- aggregate(clones ~ aa + source, dat, sum)
  tbl <- xtabs(clones ~ aa + source, res)

# this operation is rather common and often is done in one line:
  tbl <- xtabs(clones ~ ., aggregate(clones ~ ., dat, sum))

  head(tbl, 10)
>      source
> aa    source_1 source_2 source_3 source_4 source_5 source_6
>   AAA       29        0       46        0        0       14
>   AAB       20        0        0        0        0        0
>   ABB       12       14       13        0        0        0
>   ACA       16       23       16        0        0        0
>   ACB       13       19       15        0        0        0
>   BAA       17        0        0       55       16       33
>   BAC       15       19       19        0       34        0
>   BCB       30        0        0       68       38       15
>   CAA       17       11        0        0        0        0
>   CCA       15        0        0        0        0        0

The order of entries in the table is simple the order inherited during rbind . One can change that by reorganizing the table. Here, the rows are sorted.

  ord <- order(rownames(tbl))
  head(tbl[ord,], 10)
>      source
> aa    source_1 source_2 source_3 source_4 source_5 source_6
>   AAA       29        0       46        0        0       14
>   AAB       20        0        0        0        0        0
>   AAC        0       19       19        0        0       31
>   ABA        0       11        0        0       15       18
>   ABB       12       14       13        0        0        0
>   ACA       16       23       16        0        0        0
>   ACB       13       19       15        0        0        0
>   ACC        0       11       16        0       15        0
>   BAA       17        0        0       55       16       33
>   BAB        0       15        0        0        0        0

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM