[英]Create a table (matrix) of counts in r
I'm trying to develop a table from a series of lists that I've converted to data frames. 我正在尝试从一系列已转换为数据框的列表中开发一个表。 Each list is comprised of character strings and their count.
每个列表由字符串及其计数组成。 Each character string varies between 7 and 20 (or more).
每个字符串在7到20(或更多)之间变化。 Each list has a header that identifies the source of the strings.
每个列表都有一个标头,用于标识字符串的来源。 I have 66 lists (sources).
我有66个列表(来源)。 Each list contains over 5,000 strings.
每个列表包含5,000多个字符串。 Not every string is contained in every list, so the number of strings in the lists varies.
并非每个列表中都包含每个字符串,因此列表中的字符串数会有所不同。 Here's an example of the structure of a single list.
这是单个列表结构的示例。
$PreAg_18_2
CDR3.aa Clones
<chr> <int>
CASSYGTAYTGELFF 1623
CASSRGDSDNSPLHF 1440
CASSREKAFF 1161
CSGMGALAKNIQYF 949
CSAYTGLSYEQYF 813
CASSLSLAVNSPLHF 634
CAIRDTPGSPQHF 574
CATGQVNTEAFF 555
CASSLKGQGGSPLHF 499
CASSYSRSPQPQHF 478
I want to combine the results in a single table showing the counts (clones) with all the strings (CDR3.aa) listed on the y-axis and the each lists header (Sample.Id) on the x-axis. 我想将结果合并到一个表中,该表显示计数(克隆)以及y轴上列出的所有字符串(CDR3.aa)和x轴上的每个列表标头(Sample.Id)。 An example would be:
一个例子是:
10_pep_10_1 preAg_10_2 Dec_2_18_1 …...
CASSYGTAYTGELFF 1623 234 0
CASSRGDSDNSPLHF 1440 522 28
CASSREKAFF 1161 445 50
CSGMGALAKNIQYF 949 24 0
CASSYSRSPQPQHF 478 0 398
.
.
I'm able to generate a single individual list containing the as in the example, and I'm thinking that converting the lists to data frames is a better way to manipulate them, but I'm having trouble consolidating them against a single list of all the strings and moving the sample.id to the x-axis. 我能够生成包含示例中的的单个列表,并且我认为将列表转换为数据框是一种更好的操作它们的方法,但是在将它们与单个列表合并时遇到了麻烦所有字符串,然后将sample.id移至x轴。 I'm thinking I unlist and join all the strings into a single df, but I'm not sure how to keep the counts matched to the strings.
我在想我取消列出所有字符串并将它们连接到一个df中,但是我不确定如何使计数与字符串匹配。 Is there a function in R that will help me do this?
R中是否有功能可以帮助我做到这一点? Or is it unavoidable to develop a loop?
还是不可避免地要形成循环?
So far I've been able to generate a global list of strings, but I now need to match the counts by header (sample.id). 到目前为止,我已经能够生成字符串的全局列表,但是现在我需要按标头(sample.id)匹配计数。 Not sure how to approach this.
不知道如何解决这个问题。
library(immunarch)
library(stringr)
library(plyr)
immdata = repLoad("/mnt/data/Development/Analysis_Script/input_files/")
all <- immdata$data
# Get list headers (names)
sample.id <- names(all)
# make new variable for extraction of clones
all.c <- all
# Get list of clones and filter for unique clones per list.
for (i in 1:length(all.c)){
all.c[[i]]$Sample.ID<-names(all.c)[i]
all.c[[i]]<-all.c[[i]][,c("CDR3.aa", "Clones")]
}
# bysamp is a list (vector) of the samples and their clones
bysamp <- split(all.c, sample.id, sep=" ")
# make vector of all clones
all.clones <- unlist(all.c, use.names=FALSE)
# a list of the aggregate of all the clones in all the samples.
all.clones
# Removes clone repeats
all.clones.u <- unique(all.clones)
# convert list of clones and sample.ids to data frame
all.clones.u <- data.frame(all.clones.u)
sample.id <- data.frame(sample.id)
# Addtional code here:
See summary above for expected matrix (table) 有关预期矩阵,请参见上面的摘要(表)
Here's a solution based on my best guess as to the structure of your data (it sounds familiar as I'm surrounded by immunologists). 这是根据我对数据结构的最佳猜测得出的解决方案(听起来很熟悉,因为我被免疫学家包围)。 The key is to add a variable to each source that will keep track of the source.
关键是向每个源添加一个变量,以跟踪该源。 The sources (list/data.frames) can then be combined into a single data.frame and processed further.
然后,可以将源(列表/数据框架)组合为单个数据框架并进行进一步处理。
First, set a random number seed for a reproducible example. 首先,为可重现的示例设置一个随机数种子。
set.seed(1234)
Create a simplified artificial data set. 创建简化的人工数据集。 This will consist of 6 sources (list/data.frames).
这将包含6个来源(list / data.frames)。 Each data.frame has two variables named
aa
and clones
. 每个data.frame具有两个名为
aa
和clones
变量。 Three randomly selected letters from A, B and C serve as the CDR3 amino acids in each of 12 possible aa
values. 从A,B和C中随机选择的三个字母分别作为12个可能的
aa
值中的CDR3氨基酸。 The count of each clone is stored in clones
and was set as a random number between 10 and 20. Finally, each of the 6 lists/data.frames is given a name. 每个克隆的计数存储在
clones
,并设置为10到20之间的随机数。最后,为6个list / data.frames中的每一个命名。 Instead of "10_pep_10_1" I use source_1, source_2, etc. 我使用的是source_1,source_2等,而不是“ 10_pep_10_1”。
Hopefully this has replicated the data you face. 希望这可以复制您面对的数据。 By using just 3 possible amino acids, this example ensures that the same sequence has a good chance of occurring a few times in the different lists.
通过仅使用3种可能的氨基酸,此示例可确保同一序列在不同列表中多次出现的可能性很高。
# generate sample data
spl <- replicate(6, { # the braces '{}' define an expression to be repeated
n <- 12 # number of aa values in each list
aa <- replicate(n,
paste(sample(LETTERS[1:3], 3, replace = T), collapse = ""))
clones <- sample(10:20, n, replace = T)
data.frame(aa, clones)}, # this is the 'return' value of the expression
simplify = FALSE) # this ensures that the result remains as a list
# name each list
names(spl) <- paste("source", seq_along(spl), sep = "_")
Examine the first of the 6 data.frames. 检查6个data.frame中的第一个。
head(spl$source_1)
> aa clones
> 1 ABB 12
> 2 BCB 12
> 3 AAB 20
> 4 BCB 18
> 5 ACA 16
> 6 CAA 17
Add a new variable named source
to each list/data.frame that holds the name of the source. 向每个包含源名称的list / data.frame中添加一个名为
source
的新变量。 This is easily done with a simple for
loop. 使用简单的
for
循环即可轻松完成此操作。 Show the change in the first data.frame. 在第一个data.frame中显示更改。
for (i in seq_along(spl)) spl[[i]]$source <- names(spl)[i]
head(spl$source_1) # or head(spl[[1]])
> aa clones source
> 1 ABB 12 source_1
> 2 BCB 12 source_1
> 3 AAB 20 source_1
> 4 BCB 18 source_1
> 5 ACA 16 source_1
> 6 CAA 17 source_1
Now, combine each of the list/data.frames into a single data.frame with the variable source
keeping track of which list/data.frame contributed the values. 现在,将每个list / data.frame组合到一个data.frame中,变量
source
跟踪哪个list / data.frame贡献了值。 Then use base functions to tally the number ( clones
) of each peptide ( aa
) and source
. 然后使用基本函数计算每个肽段(
aa
)和source
的数量( clones
)。 The result, stored in res
is another data.frame. 存储在
res
的结果是另一个data.frame。 A contingency table of counts will be generated from this. 由此将生成一个列联计数表。 Often this is combined into a single step.
通常,这被合并为一个步骤。 See the help file for
aggregate()
for more information. 有关更多信息,请参见help
aggregate()
文件。 A popular approach for this sort of data wrangling is through the dplyr
package. 此类数据整理的一种流行方法是使用
dplyr
软件包。
dat <- do.call(rbind, spl)
res <- aggregate(clones ~ aa + source, dat, sum)
tbl <- xtabs(clones ~ aa + source, res)
# this operation is rather common and often is done in one line:
tbl <- xtabs(clones ~ ., aggregate(clones ~ ., dat, sum))
head(tbl, 10)
> source
> aa source_1 source_2 source_3 source_4 source_5 source_6
> AAA 29 0 46 0 0 14
> AAB 20 0 0 0 0 0
> ABB 12 14 13 0 0 0
> ACA 16 23 16 0 0 0
> ACB 13 19 15 0 0 0
> BAA 17 0 0 55 16 33
> BAC 15 19 19 0 34 0
> BCB 30 0 0 68 38 15
> CAA 17 11 0 0 0 0
> CCA 15 0 0 0 0 0
The order of entries in the table is simple the order inherited during rbind
. 表中条目的顺序很简单,即
rbind
期间继承的顺序。 One can change that by reorganizing the table. 可以通过重新组织表格来改变它。 Here, the rows are sorted.
在此,对行进行排序。
ord <- order(rownames(tbl))
head(tbl[ord,], 10)
> source
> aa source_1 source_2 source_3 source_4 source_5 source_6
> AAA 29 0 46 0 0 14
> AAB 20 0 0 0 0 0
> AAC 0 19 19 0 0 31
> ABA 0 11 0 0 15 18
> ABB 12 14 13 0 0 0
> ACA 16 23 16 0 0 0
> ACB 13 19 15 0 0 0
> ACC 0 11 16 0 15 0
> BAA 17 0 0 55 16 33
> BAB 0 15 0 0 0 0
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.