如何使用data.table跨多个列（loci）按组有效地计算等位基因频率（比例）

Question

I have a data.table of allele identities (rows are individuals, columns are loci), grouped by a separate column. 我有一个data.table的等位基因身份（行是个体，列是基因座），由一个单独的列分组。 I want to calculate allele frequencies (proportions) for each locus efficiently, by group. 我想按组计算每个基因座的等位基因频率（比例）。 An example data table: 示例数据表：

    DT = data.table(Loc1=rep(c("G","T"),each=5), 
      Loc2=c("C","A"), Loc3=c("C","G","G","G",
      "C","G","G","G","G","G"), 
    Group=c(rep("G1",3),rep("G2",4),rep("G3",3)))
    for(i in 1:3)
        set(DT, sample(10,2), i, NA)
    > DT
        Loc1 Loc2 Loc3 Group
     1:    G   NA    C    G1
     2:    G    A    G    G1
     3:    G    C    G    G1
     4:   NA   NA   NA    G2
     5:    G    C   NA    G2
     6:    T    A    G    G2
     7:    T    C    G    G2
     8:    T    A    G    G3
     9:    T    C    G    G3
    10:   NA    A    G    G3

The problem I have is that when I try to do calculations by group, only the allele ids present in the group are recognized, so I'm struggling to find code that can tell me eg the proportion of G's for locus 1 in all 3 groups. 我遇到的问题是，当我尝试按组进行计算时，只有组中存在的等位基因ID被识别，所以我很难找到可以告诉我例如所有3组中基因座1的G的比例的代码。。 Simple example, calculating a sum (not proportion) for the first allele at each locus: 举个简单的例子，计算每个基因座上第一个等位基因的总和（不是比例）：

    > fun1<- function(x){sum(na.omit(x==unique(na.omit(x))[1]))}
    > DT[,lapply(.SD,fun1),by=Group,.SDcols=1:3]
       Group Loc1 Loc2 Loc3
    1:    G1    3    1    1
    2:    G2    1    2    2
    3:    G3    2    2    3

For G1 the result is that Loc1 has 3 G's, but for G3 it shows Loc1 has 2 T's, not the number of G's. 对于G1，结果是Loc1有3个G，但对于G3，它表示Loc1有2个T，而不是G的数量。 I want the number of G's for both in this case. 在这种情况下，我想要两个G的数量。 So the key problem is that the allele identities are determined by group, not over the whole column. 因此，关键问题是等位基因身份是由群体决定的，而不是整个群体。 I tried making a separate table with the allele identities I want to use in calculations, but can't figure out how to include it in fun1 so that the correct cells are referenced in lapply above. 我尝试使用我想在计算中使用的等位基因身份创建一个单独的表，但无法弄清楚如何将其包含在fun1中，以便在上面的lapply中引用正确的单元格。 Allele identities table: 等位基因表：

    > fun2<- function(x){sort(na.omit(unique(x)))}
    > allele.id<-data.table(DT[,lapply(.SD,fun2),.SDcols=1:3])
    > allele.id
       Loc1 Loc2 Loc3
    1:    G    A    C
    2:    T    C    G

Answer 1

It's probably wise to transform your data.table into long format first. 将data.table转换为长格式可能是明智之举。 This will make it easier to use for further calculations (or making visualisations with ggplot2 for example). 这将使其更容易用于进一步的计算（或者例如使用ggplot2进行可视化）。 With the melt function of data.table (which works the same as the melt function of the reshape2 package) you can transform from wide to long format: 随着melt的功能data.table （工作一样melt了的功能reshape2包），你可以从广角到长格式转换：

DT2 <- melt(DT, id = "Group", variable.name = "loci")

When you want to remove the NA -values during the melt-operation, you can add na.rm = TRUE in the above call ( na.rm = FALSE is the default behaviour). 如果要在熔解操作期间删除NA ，可以在上面的调用中添加na.rm = TRUE （ na.rm = FALSE是默认行为）。

Then you can make count and proportion variables as follows: 然后你可以使计数和比例变量如下：

DT2 <- DT2[, .N, by = .(Group, loci, value)][, prop := N/sum(N), by = .(Group, loci)]

which gives the following result: 得出以下结果：

> DT2
    Group loci value N      prop
 1:    G1 Loc1     G 3 1.0000000
 2:    G2 Loc1    NA 1 0.2500000
 3:    G2 Loc1     G 1 0.2500000
 4:    G2 Loc1     T 2 0.5000000
 5:    G3 Loc1     T 2 0.6666667
 6:    G3 Loc1    NA 1 0.3333333
 7:    G1 Loc2    NA 1 0.3333333
 8:    G1 Loc2     A 1 0.3333333
 9:    G1 Loc2     C 1 0.3333333
10:    G2 Loc2    NA 1 0.2500000
11:    G2 Loc2     C 2 0.5000000
12:    G2 Loc2     A 1 0.2500000
13:    G3 Loc2     A 2 0.6666667
14:    G3 Loc2     C 1 0.3333333
15:    G1 Loc3     C 1 0.3333333
16:    G1 Loc3     G 2 0.6666667
17:    G2 Loc3    NA 2 0.5000000
18:    G2 Loc3     G 2 0.5000000
19:    G3 Loc3     G 3 1.0000000

I you want it back in wide format, you can use dcast on multiple variables: 我想要它以宽格式返回，你可以在多个变量上使用dcast ：

DT3 <- dcast(DT2, Group + loci ~ value, value.var = c("N", "prop"), fill = 0)

which results in: 这导致：

> DT3
   Group loci N_A N_C N_G N_T N_NA    prop_A    prop_C    prop_G    prop_T   prop_NA
1:    G1 Loc1   0   0   3   0    0 0.0000000 0.0000000 1.0000000 0.0000000 0.0000000
2:    G1 Loc2   1   1   0   0    1 0.3333333 0.3333333 0.0000000 0.0000000 0.3333333
3:    G1 Loc3   0   1   2   0    0 0.0000000 0.3333333 0.6666667 0.0000000 0.0000000
4:    G2 Loc1   0   0   1   2    1 0.0000000 0.0000000 0.2500000 0.5000000 0.2500000
5:    G2 Loc2   1   2   0   0    1 0.2500000 0.5000000 0.0000000 0.0000000 0.2500000
6:    G2 Loc3   0   0   2   0    2 0.0000000 0.0000000 0.5000000 0.0000000 0.5000000
7:    G3 Loc1   0   0   0   2    1 0.0000000 0.0000000 0.0000000 0.6666667 0.3333333
8:    G3 Loc2   2   1   0   0    0 0.6666667 0.3333333 0.0000000 0.0000000 0.0000000
9:    G3 Loc3   0   0   3   0    0 0.0000000 0.0000000 1.0000000 0.0000000 0.0000000

Another and straightforward approach is using melt and dcast in one call (which is a simplified version of the first part of @Frank's answer): 另一种直接的方法是在一次调用中使用melt和dcast （这是dcast答案的第一部分的简化版本）：

DT2 <- dcast(melt(DT, id="Group"), Group + variable ~ value)

which gives: 这使：

> DT2
   Group variable A C G T NA
1:    G1     Loc1 0 0 3 0  0
2:    G1     Loc2 1 1 0 0  1
3:    G1     Loc3 0 1 2 0  0
4:    G2     Loc1 0 0 1 2  1
5:    G2     Loc2 1 2 0 0  1
6:    G2     Loc3 0 0 2 0  2
7:    G3     Loc1 0 0 0 2  1
8:    G3     Loc2 2 1 0 0  0
9:    G3     Loc3 0 0 3 0  0

Because the default aggregation function in dcast is length , you will automatically get the counts for each of the values. 由于dcast的默认聚合函数是length ，因此您将自动获取每个值的计数。

Used data : 使用数据 ：

DT <- structure(list(Loc1 = c("G", "G", "G", NA, "G", "T", "T", "T", "T", NA), 
                     Loc2 = c(NA, "A", "C", NA, "C", "A", "C", "A", "C", "A"), 
                     Loc3 = c("C", "G", "G", NA, NA, "G", "G", "G", "G", "G"), 
                     Group = c("G1", "G1", "G1", "G2", "G2", "G2", "G2", "G3", "G3", "G3")), 
                .Names = c("Loc1", "Loc2", "Loc3", "Group"), row.names = c(NA, -10L), class = c("data.table", "data.frame"))

Answer 2

Here is another option using table . 这是使用table另一种选择。 (I am not sure about the format of the expected output. Also, it is not clear whether we need to include the NA elements in calculation of proportion. If we don't need it, we can remove the useNA=... .) （我不确定预期输出的格式。另外，我们还不清楚是否需要在计算比例时包含NA元素。如果我们不需要它，我们可以删除useNA=... ）

We loop through the 'Loc' columns, create a table of that column with the 'Group', get the proportion using prop.table (specifying the margin ) and store the results in a list ('lst'). 我们遍历'Loc'列，使用'Group'创建该列的table ，使用prop.table获取比例（指定margin ）并将结果存储在list （'lst'）。

nm1 <- paste0('Loc', 1:3)
lst <- vector('list' , length(nm1))

 for(i in seq_along(nm1)){
   temp <- table(DT$Group, DT[[i]], useNA= 'ifany')
   lst[[i]] <- list(temp, prop.table(temp, 1))
}



lst[[1]]
#[[1]]
#    
#     G T <NA>
#  G1 3 0    0
#  G2 1 2    1
#  G3 0 2    1

#[[2]]
#    
#             G         T      <NA>
#  G1 1.0000000 0.0000000 0.0000000
#  G2 0.2500000 0.5000000 0.2500000
#  G3 0.0000000 0.6666667 0.3333333

如何使用data.table跨多个列（loci）按组有效地计算等位基因频率（比例）

问题描述

2 个解决方案

解决方案1
5 已采纳 2015-10-17 12:07:09

解决方案2
1 2015-10-17 13:36:34

如何使用data.table跨多个列（loci）按组有效地计算等位基因频率（比例）

问题描述

2 个解决方案

解决方案1 5 已采纳 2015-10-17 12:07:09

解决方案2 1 2015-10-17 13:36:34

解决方案1
5 已采纳 2015-10-17 12:07:09

解决方案2
1 2015-10-17 13:36:34