简体   繁体   English

如何使循环基于列中的字符串计算函数?

[英]How to make a loop to calculate a function based on strings in a column?

I have a data.frame that looks like: 我有一个data.frame看起来像:

              SNP              CLST A1 A2       FRQ IMP     POS CHR BVAL
    1   rs2803291            Brahui  C  T  0.660000   0 1882185   1  878
    2   rs2803291           Balochi  C  T  0.750000   0 1882185   1  878
    3   rs2803291            Hazara  C  T  0.772727   0 1882185   1  878
    4   rs2803291           Makrani  C  T  0.620000   0 1882185   1  878
    5   rs2803291            Sindhi  C  T  0.770833   0 1882185   1  878
    6   rs2803291            Pathan  C  T  0.681818   0 1882185   1  878
    53  rs12060022           Brahui  T  C 0.0600000   1 3108186   1  982
    54  rs12060022          Balochi  T  C 0.0416667   1 3108186   1  982
    55  rs12060022           Hazara  T  C 0.0000000   1 3108186   1  982
    56  rs12060022          Makrani  T  C 0.0200000   1 3108186   1  982
    57  rs12060022           Sindhi  T  C 0.0625000   1 3108186   1  982
    58  rs12060022           Pathan  T  C 0.0681818   1 3108186   1  982
    105   rs870171           Brahui  T  G 0.2200000   0 3332664   1  976
    106   rs870171          Balochi  T  G 0.3333330   0 3332664   1  976
    107   rs870171           Hazara  T  G 0.3636360   0 3332664   1  976
    108   rs870171          Makrani  T  G 0.1800000   0 3332664   1  976
    109   rs870171           Sindhi  T  G 0.2083330   0 3332664   1  976
    110   rs870171           Pathan  T  G 0.1590910   0 3332664   1  976
    157  rs4282783           Brahui  G  T 0.8400000   1 4090545   1  992
    158  rs4282783          Balochi  G  T 0.9583333   1 4090545   1  992
    159  rs4282783           Hazara  G  T 0.8409090   1 4090545   1  992
    160  rs4282783          Makrani  G  T 0.9000000   1 4090545   1  992
    161  rs4282783           Sindhi  G  T 0.8958330   1 4090545   1  992
    162  rs4282783           Pathan  G  T 0.9772727   1 4090545   1  992

Each SNP locus has certain populations associated with it and a certain frequency (FRQ) for each population. 每个SNP位点都具有与之相关的某些种群,并且每个种群都有一定的频率(FRQ)。 There are "L" amount of unique SNPs in the total data.frame. 总data.frame中有“ L”个唯一SNP。 I would like to randomly sample 3 SNPs from the data.frame and then I would like to take the sum of (FRQ_balochi_SNP1 - FRQ_Pathan_SNP1)* *(FRQ_Y_SNP1 - FRQ_Pathan_SNP1) across + (FRQ_balochi_SNP2 - FRQ_Pathan_SNP2) * (FRQ_Y_SNP2 - FRQ_Pathan_SNP2) + (FRQ_balochi_SNP3 - FRQ_Pathan_SNP3) * (FRQ_Y_SNP3 - FRQ_Pathan_SNP3) using the "3" randomly generated SNPs. 我想从data.frame中随机采样3个SNP,然后我想取(FRQ_balochi_SNP1-FRQ_Pathan_SNP1)* *(FRQ_Y_SNP1-FRQ_Pathan_SNP1)和+(FRQ_balochi_SNP2-FRQ_Pathan_SNP2(NPQY_NP)(NPQ) FRQ_balochi_SNP3-FRQ_Pathan_SNP3)*(FRQ_Y_SNP3-FRQ_Pathan_SNP3)使用随机生成的“ 3”个SNP。 The notation looks something like Value = Sum(i to 3) of (FRQ_Bal_i - FRQ_Pat_i) * (FRQ_Y_i - FRQ_Pat_i) . 表示法类似于Value = Sum(i to 3) of (FRQ_Bal_i - FRQ_Pat_i) * (FRQ_Y_i - FRQ_Pat_i) Y is a given population. Y是给定的人口。 For example: "Hazara". 例如:“哈扎拉”。

I would like my output to be a list of Values from this calculation along with their Y populations. 我希望我的输出是此计算中的值及其Y总体的列表。

For example, let's walk through Hazara as our Y population. 例如,让我们以哈扎拉(Yazar)为例了解我们的Y人口。 We randomly sample and get SNP1, SNP2, and SNP4. 我们随机采样并获得SNP1,SNP2和SNP4。 The first SNP (rs2803291) gives us (0.75 - 0.681818) * (0.772727 - 0.681818) for a value of 0.006198 . 第一个SNP(rs2803291)给我们(0.75 - 0.681818) * (0.772727 - 0.681818) ,值为0.006198 The second SNP (rs12060022) gives us (0.041666 - 0.0681818) * (0.0000 - 0.061818) for a value of 0.001639 . 第二个SNP(rs12060022)给我们(0.041666 - 0.0681818) * (0.0000 - 0.061818) ,值为0.001639 The fourth SNP (rs4282783) gives us (0.958333 - 0.9772727) * (0.8409090 - 0.9772727) for a value of 0.002582 . 第四个SNP(rs4282783)给我们(0.958333 - 0.9772727) * (0.8409090 - 0.9772727)的值为0.002582 Summing our values together we would get 0.006198+0.001639+0.002582 for a total sum of 0.01402 . 将我们的值加总起来,我们将得到0.006198+0.001639+0.002582 ,总和为0.01402 Thus the first line of the output file would be 因此,输出文件的第一行将是

Population   Value
Hazara       0.01402
Makrani      ???

I would like this done for every population, including Balochi and Pathan if possible. 我希望对所有人口都做到这一点,如果可能的话,包括Balochi和Pathan。

I would create a helper function then place it into a looping mechanism that will try out each label: 我将创建一个辅助函数,然后将其放入将尝试每个标签的循环机制中:

library(dplyr)

snp_sum <- function(SNP, FRQ, CLST) {
  (FRQ[CLST == "Balochi"] - FRQ[CLST == "Pathan"]) * (FRQ[CLST == SNP] - FRQ[CLST == "Pathan"])
}

sum_df <- function(mydf, clst_list) {
  lst <- lapply(clst_list, function(x) {
           mydf %>% group_by(SNP) %>%
           summarise(FRQ_SUM=snp_sum(x, FRQ, CLST)) %>%
           summarise(Value=sum(FRQ_SUM[sample(n(), 3)]))
         })
  cbind.data.frame(Population=clst_list, do.call("rbind", lst))
}

sum_df(df1, unique(df1$CLST))
#   Population        Value
# 1     Brahui 0.0134297098
# 2    Balochi 0.0353677606
# 3     Hazara 0.0400308238
# 4    Makrani 0.0008918497
# 5     Sindhi 0.0161916643
# 6     Pathan 0.0000000000

Edit 编辑

Possible speed up with a built-in R package called parallel : 内置的称为parallel R包可能会加快速度:

library(parallel)
no_cores <- detectCores() - 1L
cl <- makeCluster(no_cores)
clusterExport(cl, c("df1", "snp_sum"))
clusterEvalQ(cl, library(dplyr))

sum_parallel <- parLapply(cl, unique(df1$CLST), function(x) {

  df1 %>% group_by(SNP) %>%
    summarise(FRQ_SUM = snp_sum(x, FRQ, CLST)) %>%
    summarise(Value=sum(FRQ_SUM[sample(n(), 3)]))
})

cbind.data.frame(Population=unique(df1$CLST), do.call("rbind", sum_parallel))

stopCluster(cl)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM