[英]How to make a loop to calculate a function based on strings in a column?
I have a data.frame that looks like: 我有一个data.frame看起来像:
SNP CLST A1 A2 FRQ IMP POS CHR BVAL
1 rs2803291 Brahui C T 0.660000 0 1882185 1 878
2 rs2803291 Balochi C T 0.750000 0 1882185 1 878
3 rs2803291 Hazara C T 0.772727 0 1882185 1 878
4 rs2803291 Makrani C T 0.620000 0 1882185 1 878
5 rs2803291 Sindhi C T 0.770833 0 1882185 1 878
6 rs2803291 Pathan C T 0.681818 0 1882185 1 878
53 rs12060022 Brahui T C 0.0600000 1 3108186 1 982
54 rs12060022 Balochi T C 0.0416667 1 3108186 1 982
55 rs12060022 Hazara T C 0.0000000 1 3108186 1 982
56 rs12060022 Makrani T C 0.0200000 1 3108186 1 982
57 rs12060022 Sindhi T C 0.0625000 1 3108186 1 982
58 rs12060022 Pathan T C 0.0681818 1 3108186 1 982
105 rs870171 Brahui T G 0.2200000 0 3332664 1 976
106 rs870171 Balochi T G 0.3333330 0 3332664 1 976
107 rs870171 Hazara T G 0.3636360 0 3332664 1 976
108 rs870171 Makrani T G 0.1800000 0 3332664 1 976
109 rs870171 Sindhi T G 0.2083330 0 3332664 1 976
110 rs870171 Pathan T G 0.1590910 0 3332664 1 976
157 rs4282783 Brahui G T 0.8400000 1 4090545 1 992
158 rs4282783 Balochi G T 0.9583333 1 4090545 1 992
159 rs4282783 Hazara G T 0.8409090 1 4090545 1 992
160 rs4282783 Makrani G T 0.9000000 1 4090545 1 992
161 rs4282783 Sindhi G T 0.8958330 1 4090545 1 992
162 rs4282783 Pathan G T 0.9772727 1 4090545 1 992
Each SNP locus has certain populations associated with it and a certain frequency (FRQ) for each population. 每个SNP位点都具有与之相关的某些种群,并且每个种群都有一定的频率(FRQ)。 There are "L" amount of unique SNPs in the total data.frame.
总data.frame中有“ L”个唯一SNP。 I would like to randomly sample 3 SNPs from the data.frame and then I would like to take the sum of (FRQ_balochi_SNP1 - FRQ_Pathan_SNP1)* *(FRQ_Y_SNP1 - FRQ_Pathan_SNP1) across + (FRQ_balochi_SNP2 - FRQ_Pathan_SNP2) * (FRQ_Y_SNP2 - FRQ_Pathan_SNP2) + (FRQ_balochi_SNP3 - FRQ_Pathan_SNP3) * (FRQ_Y_SNP3 - FRQ_Pathan_SNP3) using the "3" randomly generated SNPs.
我想从data.frame中随机采样3个SNP,然后我想取(FRQ_balochi_SNP1-FRQ_Pathan_SNP1)* *(FRQ_Y_SNP1-FRQ_Pathan_SNP1)和+(FRQ_balochi_SNP2-FRQ_Pathan_SNP2(NPQY_NP)(NPQ) FRQ_balochi_SNP3-FRQ_Pathan_SNP3)*(FRQ_Y_SNP3-FRQ_Pathan_SNP3)使用随机生成的“ 3”个SNP。 The notation looks something like
Value = Sum(i to 3) of (FRQ_Bal_i - FRQ_Pat_i) * (FRQ_Y_i - FRQ_Pat_i)
. 表示法类似于
Value = Sum(i to 3) of (FRQ_Bal_i - FRQ_Pat_i) * (FRQ_Y_i - FRQ_Pat_i)
。 Y is a given population. Y是给定的人口。 For example: "Hazara".
例如:“哈扎拉”。
I would like my output to be a list of Values from this calculation along with their Y populations. 我希望我的输出是此计算中的值及其Y总体的列表。
For example, let's walk through Hazara as our Y population. 例如,让我们以哈扎拉(Yazar)为例了解我们的Y人口。 We randomly sample and get SNP1, SNP2, and SNP4.
我们随机采样并获得SNP1,SNP2和SNP4。 The first SNP (rs2803291) gives us
(0.75 - 0.681818) * (0.772727 - 0.681818)
for a value of 0.006198
. 第一个SNP(rs2803291)给我们
(0.75 - 0.681818) * (0.772727 - 0.681818)
,值为0.006198
。 The second SNP (rs12060022) gives us (0.041666 - 0.0681818) * (0.0000 - 0.061818)
for a value of 0.001639
. 第二个SNP(rs12060022)给我们
(0.041666 - 0.0681818) * (0.0000 - 0.061818)
,值为0.001639
。 The fourth SNP (rs4282783) gives us (0.958333 - 0.9772727) * (0.8409090 - 0.9772727)
for a value of 0.002582
. 第四个SNP(rs4282783)给我们
(0.958333 - 0.9772727) * (0.8409090 - 0.9772727)
的值为0.002582
。 Summing our values together we would get 0.006198+0.001639+0.002582
for a total sum of 0.01402
. 将我们的值加总起来,我们将得到
0.006198+0.001639+0.002582
,总和为0.01402
。 Thus the first line of the output file would be 因此,输出文件的第一行将是
Population Value
Hazara 0.01402
Makrani ???
I would like this done for every population, including Balochi and Pathan if possible. 我希望对所有人口都做到这一点,如果可能的话,包括Balochi和Pathan。
I would create a helper function then place it into a looping mechanism that will try out each label: 我将创建一个辅助函数,然后将其放入将尝试每个标签的循环机制中:
library(dplyr)
snp_sum <- function(SNP, FRQ, CLST) {
(FRQ[CLST == "Balochi"] - FRQ[CLST == "Pathan"]) * (FRQ[CLST == SNP] - FRQ[CLST == "Pathan"])
}
sum_df <- function(mydf, clst_list) {
lst <- lapply(clst_list, function(x) {
mydf %>% group_by(SNP) %>%
summarise(FRQ_SUM=snp_sum(x, FRQ, CLST)) %>%
summarise(Value=sum(FRQ_SUM[sample(n(), 3)]))
})
cbind.data.frame(Population=clst_list, do.call("rbind", lst))
}
sum_df(df1, unique(df1$CLST))
# Population Value
# 1 Brahui 0.0134297098
# 2 Balochi 0.0353677606
# 3 Hazara 0.0400308238
# 4 Makrani 0.0008918497
# 5 Sindhi 0.0161916643
# 6 Pathan 0.0000000000
Edit 编辑
Possible speed up with a built-in R package called parallel
: 内置的称为
parallel
R包可能会加快速度:
library(parallel)
no_cores <- detectCores() - 1L
cl <- makeCluster(no_cores)
clusterExport(cl, c("df1", "snp_sum"))
clusterEvalQ(cl, library(dplyr))
sum_parallel <- parLapply(cl, unique(df1$CLST), function(x) {
df1 %>% group_by(SNP) %>%
summarise(FRQ_SUM = snp_sum(x, FRQ, CLST)) %>%
summarise(Value=sum(FRQ_SUM[sample(n(), 3)]))
})
cbind.data.frame(Population=unique(df1$CLST), do.call("rbind", sum_parallel))
stopCluster(cl)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.