[英]How to use data.table to efficiently calculate allele frequencies (proportions) by group across multiple columns (loci)
I have a data.table of allele identities (rows are individuals, columns are loci), grouped by a separate column. 我有一个data.table的等位基因身份(行是个体,列是基因座),由一个单独的列分组。 I want to calculate allele frequencies (proportions) for each locus efficiently, by group.
我想按组计算每个基因座的等位基因频率(比例)。 An example data table:
示例数据表:
DT = data.table(Loc1=rep(c("G","T"),each=5),
Loc2=c("C","A"), Loc3=c("C","G","G","G",
"C","G","G","G","G","G"),
Group=c(rep("G1",3),rep("G2",4),rep("G3",3)))
for(i in 1:3)
set(DT, sample(10,2), i, NA)
> DT
Loc1 Loc2 Loc3 Group
1: G NA C G1
2: G A G G1
3: G C G G1
4: NA NA NA G2
5: G C NA G2
6: T A G G2
7: T C G G2
8: T A G G3
9: T C G G3
10: NA A G G3
The problem I have is that when I try to do calculations by group, only the allele ids present in the group are recognized, so I'm struggling to find code that can tell me eg the proportion of G's for locus 1 in all 3 groups. 我遇到的问题是,当我尝试按组进行计算时,只有组中存在的等位基因ID被识别,所以我很难找到可以告诉我例如所有3组中基因座1的G的比例的代码。 。 Simple example, calculating a sum (not proportion) for the first allele at each locus:
举个简单的例子,计算每个基因座上第一个等位基因的总和(不是比例):
> fun1<- function(x){sum(na.omit(x==unique(na.omit(x))[1]))}
> DT[,lapply(.SD,fun1),by=Group,.SDcols=1:3]
Group Loc1 Loc2 Loc3
1: G1 3 1 1
2: G2 1 2 2
3: G3 2 2 3
For G1 the result is that Loc1 has 3 G's, but for G3 it shows Loc1 has 2 T's, not the number of G's. 对于G1,结果是Loc1有3个G,但对于G3,它表示Loc1有2个T,而不是G的数量。 I want the number of G's for both in this case.
在这种情况下,我想要两个G的数量。 So the key problem is that the allele identities are determined by group, not over the whole column.
因此,关键问题是等位基因身份是由群体决定的,而不是整个群体。 I tried making a separate table with the allele identities I want to use in calculations, but can't figure out how to include it in fun1 so that the correct cells are referenced in lapply above.
我尝试使用我想在计算中使用的等位基因身份创建一个单独的表,但无法弄清楚如何将其包含在fun1中,以便在上面的lapply中引用正确的单元格。 Allele identities table:
等位基因表:
> fun2<- function(x){sort(na.omit(unique(x)))}
> allele.id<-data.table(DT[,lapply(.SD,fun2),.SDcols=1:3])
> allele.id
Loc1 Loc2 Loc3
1: G A C
2: T C G
It's probably wise to transform your data.table into long format first. 将data.table转换为长格式可能是明智之举。 This will make it easier to use for further calculations (or making visualisations with
ggplot2
for example). 这将使其更容易用于进一步的计算(或者例如使用
ggplot2
进行可视化)。 With the melt
function of data.table
(which works the same as the melt
function of the reshape2
package) you can transform from wide to long format: 随着
melt
的功能data.table
(工作一样melt
了的功能reshape2
包),你可以从广角到长格式转换:
DT2 <- melt(DT, id = "Group", variable.name = "loci")
When you want to remove the NA
-values during the melt-operation, you can add na.rm = TRUE
in the above call ( na.rm = FALSE
is the default behaviour). 如果要在熔解操作期间删除
NA
,可以在上面的调用中添加na.rm = TRUE
( na.rm = FALSE
是默认行为)。
Then you can make count and proportion variables as follows: 然后你可以使计数和比例变量如下:
DT2 <- DT2[, .N, by = .(Group, loci, value)][, prop := N/sum(N), by = .(Group, loci)]
which gives the following result: 得出以下结果:
> DT2
Group loci value N prop
1: G1 Loc1 G 3 1.0000000
2: G2 Loc1 NA 1 0.2500000
3: G2 Loc1 G 1 0.2500000
4: G2 Loc1 T 2 0.5000000
5: G3 Loc1 T 2 0.6666667
6: G3 Loc1 NA 1 0.3333333
7: G1 Loc2 NA 1 0.3333333
8: G1 Loc2 A 1 0.3333333
9: G1 Loc2 C 1 0.3333333
10: G2 Loc2 NA 1 0.2500000
11: G2 Loc2 C 2 0.5000000
12: G2 Loc2 A 1 0.2500000
13: G3 Loc2 A 2 0.6666667
14: G3 Loc2 C 1 0.3333333
15: G1 Loc3 C 1 0.3333333
16: G1 Loc3 G 2 0.6666667
17: G2 Loc3 NA 2 0.5000000
18: G2 Loc3 G 2 0.5000000
19: G3 Loc3 G 3 1.0000000
I you want it back in wide format, you can use dcast
on multiple variables: 我想要它以宽格式返回,你可以在多个变量上使用
dcast
:
DT3 <- dcast(DT2, Group + loci ~ value, value.var = c("N", "prop"), fill = 0)
which results in: 这导致:
> DT3
Group loci N_A N_C N_G N_T N_NA prop_A prop_C prop_G prop_T prop_NA
1: G1 Loc1 0 0 3 0 0 0.0000000 0.0000000 1.0000000 0.0000000 0.0000000
2: G1 Loc2 1 1 0 0 1 0.3333333 0.3333333 0.0000000 0.0000000 0.3333333
3: G1 Loc3 0 1 2 0 0 0.0000000 0.3333333 0.6666667 0.0000000 0.0000000
4: G2 Loc1 0 0 1 2 1 0.0000000 0.0000000 0.2500000 0.5000000 0.2500000
5: G2 Loc2 1 2 0 0 1 0.2500000 0.5000000 0.0000000 0.0000000 0.2500000
6: G2 Loc3 0 0 2 0 2 0.0000000 0.0000000 0.5000000 0.0000000 0.5000000
7: G3 Loc1 0 0 0 2 1 0.0000000 0.0000000 0.0000000 0.6666667 0.3333333
8: G3 Loc2 2 1 0 0 0 0.6666667 0.3333333 0.0000000 0.0000000 0.0000000
9: G3 Loc3 0 0 3 0 0 0.0000000 0.0000000 1.0000000 0.0000000 0.0000000
Another and straightforward approach is using melt
and dcast
in one call (which is a simplified version of the first part of @Frank's answer): 另一种直接的方法是在一次调用中使用
melt
和dcast
(这是dcast
答案的第一部分的简化版本):
DT2 <- dcast(melt(DT, id="Group"), Group + variable ~ value)
which gives: 这使:
> DT2
Group variable A C G T NA
1: G1 Loc1 0 0 3 0 0
2: G1 Loc2 1 1 0 0 1
3: G1 Loc3 0 1 2 0 0
4: G2 Loc1 0 0 1 2 1
5: G2 Loc2 1 2 0 0 1
6: G2 Loc3 0 0 2 0 2
7: G3 Loc1 0 0 0 2 1
8: G3 Loc2 2 1 0 0 0
9: G3 Loc3 0 0 3 0 0
Because the default aggregation function in dcast
is length
, you will automatically get the counts for each of the values. 由于
dcast
的默认聚合函数是length
,因此您将自动获取每个值的计数。
Used data : 使用数据 :
DT <- structure(list(Loc1 = c("G", "G", "G", NA, "G", "T", "T", "T", "T", NA),
Loc2 = c(NA, "A", "C", NA, "C", "A", "C", "A", "C", "A"),
Loc3 = c("C", "G", "G", NA, NA, "G", "G", "G", "G", "G"),
Group = c("G1", "G1", "G1", "G2", "G2", "G2", "G2", "G3", "G3", "G3")),
.Names = c("Loc1", "Loc2", "Loc3", "Group"), row.names = c(NA, -10L), class = c("data.table", "data.frame"))
Here is another option using table
. 这是使用
table
另一种选择。 (I am not sure about the format of the expected output. Also, it is not clear whether we need to include the NA
elements in calculation of proportion. If we don't need it, we can remove the useNA=...
.) (我不确定预期输出的格式。另外,我们还不清楚是否需要在计算比例时包含
NA
元素。如果我们不需要它,我们可以删除useNA=...
)
We loop through the 'Loc' columns, create a table
of that column with the 'Group', get the proportion using prop.table
(specifying the margin
) and store the results in a list
('lst'). 我们遍历'Loc'列,使用'Group'创建该列的
table
,使用prop.table
获取比例(指定margin
)并将结果存储在list
('lst')。
nm1 <- paste0('Loc', 1:3)
lst <- vector('list' , length(nm1))
for(i in seq_along(nm1)){
temp <- table(DT$Group, DT[[i]], useNA= 'ifany')
lst[[i]] <- list(temp, prop.table(temp, 1))
}
lst[[1]]
#[[1]]
#
# G T <NA>
# G1 3 0 0
# G2 1 2 1
# G3 0 2 1
#[[2]]
#
# G T <NA>
# G1 1.0000000 0.0000000 0.0000000
# G2 0.2500000 0.5000000 0.2500000
# G3 0.0000000 0.6666667 0.3333333
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.