简体   繁体   English

在r中创建一个计数表(矩阵)

[英]Create a table (matrix) of counts in r

I'm trying to develop a table from a series of lists that I've converted to data frames. 我正在尝试从一系列已转换为数据框的列表中开发一个表。 Each list is comprised of character strings and their count. 每个列表由字符串及其计数组成。 Each character string varies between 7 and 20 (or more). 每个字符串在7到20(或更多)之间变化。 Each list has a header that identifies the source of the strings. 每个列表都有一个标头,用于标识字符串的来源。 I have 66 lists (sources). 我有66个列表(来源)。 Each list contains over 5,000 strings. 每个列表包含5,000多个字符串。 Not every string is contained in every list, so the number of strings in the lists varies. 并非每个列表中都包含每个字符串,因此列表中的字符串数会有所不同。 Here's an example of the structure of a single list. 这是单个列表结构的示例。

$PreAg_18_2

CDR3.aa         Clones
 <chr>            <int>
CASSYGTAYTGELFF   1623
CASSRGDSDNSPLHF   1440
CASSREKAFF        1161
CSGMGALAKNIQYF     949
CSAYTGLSYEQYF      813
CASSLSLAVNSPLHF    634
CAIRDTPGSPQHF      574
CATGQVNTEAFF       555
CASSLKGQGGSPLHF    499
CASSYSRSPQPQHF     478

I want to combine the results in a single table showing the counts (clones) with all the strings (CDR3.aa) listed on the y-axis and the each lists header (Sample.Id) on the x-axis. 我想将结果合并到一个表中,该表显示计数(克隆)以及y轴上列出的所有字符串(CDR3.aa)和x轴上的每个列表标头(Sample.Id)。 An example would be: 一个例子是:

            10_pep_10_1     preAg_10_2      Dec_2_18_1  …... 
CASSYGTAYTGELFF    1623         234             0
CASSRGDSDNSPLHF    1440         522             28
CASSREKAFF         1161         445             50  
CSGMGALAKNIQYF      949         24              0
CASSYSRSPQPQHF      478         0               398
.
.

I'm able to generate a single individual list containing the as in the example, and I'm thinking that converting the lists to data frames is a better way to manipulate them, but I'm having trouble consolidating them against a single list of all the strings and moving the sample.id to the x-axis. 我能够生成包含示例中的的单个列表,并且我认为将列表转换为数据框是一种更好的操作它们的方法,但是在将它们与单个列表合并时遇到了麻烦所有字符串,然后将sample.id移至x轴。 I'm thinking I unlist and join all the strings into a single df, but I'm not sure how to keep the counts matched to the strings. 我在想我取消列出所有字符串并将它们连接到一个df中,但是我不确定如何使计数与字符串匹配。 Is there a function in R that will help me do this? R中是否有功能可以帮助我做到这一点? Or is it unavoidable to develop a loop? 还是不可避免地要形成循环?

So far I've been able to generate a global list of strings, but I now need to match the counts by header (sample.id). 到目前为止,我已经能够生成字符串的全局列表,但是现在我需要按标头(sample.id)匹配计数。 Not sure how to approach this. 不知道如何解决这个问题。

    library(immunarch)
    library(stringr)
    library(plyr)

    immdata = repLoad("/mnt/data/Development/Analysis_Script/input_files/")

    all <- immdata$data

    # Get list headers (names)
    sample.id <- names(all)

    # make new variable for extraction of clones
    all.c <- all

    # Get list of clones and filter for unique clones per list.
    for (i in 1:length(all.c)){
        all.c[[i]]$Sample.ID<-names(all.c)[i]
        all.c[[i]]<-all.c[[i]][,c("CDR3.aa", "Clones")]
    }


    # bysamp is a list (vector) of the samples and their clones
    bysamp <- split(all.c, sample.id, sep=" ")

    # make vector of all clones
    all.clones <- unlist(all.c, use.names=FALSE)

    # a list of the aggregate of all the clones in all the samples.
    all.clones

    # Removes clone repeats
    all.clones.u <- unique(all.clones)

    # convert list of clones and sample.ids to data frame
    all.clones.u <- data.frame(all.clones.u)
    sample.id <- data.frame(sample.id)

    # Addtional code here:

See summary above for expected matrix (table) 有关预期矩阵,请参见上面的摘要(表)

Here's a solution based on my best guess as to the structure of your data (it sounds familiar as I'm surrounded by immunologists). 这是根据我对数据结构的最佳猜测得出的解决方案(听起来很熟悉,因为我被免疫学家包围)。 The key is to add a variable to each source that will keep track of the source. 关键是向每个源添加一个变量,以跟踪该源。 The sources (list/data.frames) can then be combined into a single data.frame and processed further. 然后,可以将源(列表/数据框架)组合为单个数据框架并进行进一步处理。

First, set a random number seed for a reproducible example. 首先,为可重现的示例设置一个随机数种子。

  set.seed(1234)

Create a simplified artificial data set. 创建简化的人工数据集。 This will consist of 6 sources (list/data.frames). 这将包含6个来源(list / data.frames)。 Each data.frame has two variables named aa and clones . 每个data.frame具有两个名为aaclones变量。 Three randomly selected letters from A, B and C serve as the CDR3 amino acids in each of 12 possible aa values. 从A,B和C中随机选择的三个字母分别作为12个可能的aa值中的CDR3氨基酸。 The count of each clone is stored in clones and was set as a random number between 10 and 20. Finally, each of the 6 lists/data.frames is given a name. 每个克隆的计数存储在clones ,并设置为10到20之间的随机数。最后,为6个list / data.frames中的每一个命名。 Instead of "10_pep_10_1" I use source_1, source_2, etc. 我使用的是source_1,source_2等,而不是“ 10_pep_10_1”。

Hopefully this has replicated the data you face. 希望这可以复制您面对的数据。 By using just 3 possible amino acids, this example ensures that the same sequence has a good chance of occurring a few times in the different lists. 通过仅使用3种可能的氨基酸,此示例可确保同一序列在不同列表中多次出现的可能性很高。

# generate sample data
  spl <- replicate(6, { # the braces '{}' define an expression to be repeated
      n <- 12 # number of aa values in each list
      aa <- replicate(n,
        paste(sample(LETTERS[1:3], 3, replace = T), collapse = ""))
      clones <- sample(10:20, n, replace = T)
      data.frame(aa, clones)}, # this is the 'return' value of the expression
    simplify = FALSE) # this ensures that the result remains as a list

# name each list
  names(spl) <- paste("source", seq_along(spl), sep = "_")

Examine the first of the 6 data.frames. 检查6个data.frame中的第一个。

  head(spl$source_1)
>    aa clones
> 1 ABB     12
> 2 BCB     12
> 3 AAB     20
> 4 BCB     18
> 5 ACA     16
> 6 CAA     17

Add a new variable named source to each list/data.frame that holds the name of the source. 向每个包含源名称的list / data.frame中添加一个名为source的新变量。 This is easily done with a simple for loop. 使用简单的for循环即可轻松完成此操作。 Show the change in the first data.frame. 在第一个data.frame中显示更改。

  for (i in seq_along(spl)) spl[[i]]$source <- names(spl)[i]

  head(spl$source_1) # or head(spl[[1]])
>    aa clones   source
> 1 ABB     12 source_1
> 2 BCB     12 source_1
> 3 AAB     20 source_1
> 4 BCB     18 source_1
> 5 ACA     16 source_1
> 6 CAA     17 source_1

Now, combine each of the list/data.frames into a single data.frame with the variable source keeping track of which list/data.frame contributed the values. 现在,将每个list / data.frame组合到一个data.frame中,变量source跟踪哪个list / data.frame贡献了值。 Then use base functions to tally the number ( clones ) of each peptide ( aa ) and source . 然后使用基本函数计算每个肽段( aa )和source的数量( clones )。 The result, stored in res is another data.frame. 存储在res的结果是另一个data.frame。 A contingency table of counts will be generated from this. 由此将生成一个列联计数表。 Often this is combined into a single step. 通常,这被合并为一个步骤。 See the help file for aggregate() for more information. 有关更多信息,请参见help aggregate()文件。 A popular approach for this sort of data wrangling is through the dplyr package. 此类数据整理的一种流行方法是使用dplyr软件包。

  dat <- do.call(rbind, spl)

  res <- aggregate(clones ~ aa + source, dat, sum)
  tbl <- xtabs(clones ~ aa + source, res)

# this operation is rather common and often is done in one line:
  tbl <- xtabs(clones ~ ., aggregate(clones ~ ., dat, sum))

  head(tbl, 10)
>      source
> aa    source_1 source_2 source_3 source_4 source_5 source_6
>   AAA       29        0       46        0        0       14
>   AAB       20        0        0        0        0        0
>   ABB       12       14       13        0        0        0
>   ACA       16       23       16        0        0        0
>   ACB       13       19       15        0        0        0
>   BAA       17        0        0       55       16       33
>   BAC       15       19       19        0       34        0
>   BCB       30        0        0       68       38       15
>   CAA       17       11        0        0        0        0
>   CCA       15        0        0        0        0        0

The order of entries in the table is simple the order inherited during rbind . 表中条目的顺序很简单,即rbind期间继承的顺序。 One can change that by reorganizing the table. 可以通过重新组织表格来改变它。 Here, the rows are sorted. 在此,对行进行排序。

  ord <- order(rownames(tbl))
  head(tbl[ord,], 10)
>      source
> aa    source_1 source_2 source_3 source_4 source_5 source_6
>   AAA       29        0       46        0        0       14
>   AAB       20        0        0        0        0        0
>   AAC        0       19       19        0        0       31
>   ABA        0       11        0        0       15       18
>   ABB       12       14       13        0        0        0
>   ACA       16       23       16        0        0        0
>   ACB       13       19       15        0        0        0
>   ACC        0       11       16        0       15        0
>   BAA       17        0        0       55       16       33
>   BAB        0       15        0        0        0        0

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM