简体   繁体   English

如何创建一个循环来重复R中的随机抽样程序

[英]How to create a loop to repeat random sampling procedure in R

I have written some code in R to sample without replacement from 3 separate vectors (list1, list2, list3). 我在R中编写了一些代码,无需替换3个独立的向量(list1,list2,list3)。 I sample 10 times from list1, 20 times from list 2 and 30 times from list 3. I then combine the 3 lists of random samples and check how many times I have sampled the same string 2 or 3 times. 我从list1中抽样10次,从列表2中抽样20次,从列表3中抽样30次。然后我将3个随机抽样列表组合起来,并检查我对相同字符串抽样了多少次2或3次。 How would I go about automating this so that I can sample 100 times and get a distribution of frequency counts? 我如何进行自动化,以便我可以100次采样并获得频率计数分布? For example I want to see how frequently I randomly sample the same string from all three lists. 例如,我想看看我从三个列表中随机抽样相同字符串的频率。 Thank you for your assistance. 谢谢您的帮助。

All input data are lists of thousands of strings like this: 所有输入数据都是数千个字符串的列表,如下所示:

list1: 列表1:

     V1         
[1,] "EDA"
[2,] "MGN2"  
[3,] "5RSK"      
[4,] "NBLN"

My current code: 我目前的代码:

sample_list1 <-(sample(list1,10, replace=FALSE))
sample_list2 <-(sample(list2,20, replace=FALSE))
sample_list3 <-(sample(list3,20, replace=FALSE))

combined_randomgenes <- c(list1, list2, list3)
combined_counts <- as.data.frame(table(combined_randomgenes))

overlap_3_lists <- nrow(subset(combined_counts, Freq == 3))
overlap_2_lists <- nrow(subset(combined_counts, Freq == 2))

If across my 3 random samples there was only 1 string that occurred in all 3 random samples then I would expect overlap_3_lists to contain the value 1. I would like to automate so that I get a distribution of values so that I can plot a histogram to see how many times there are 0, 1, 2, 3 etc identical strings that are sampled in all 3 lists. 如果我的3个随机样本中只有1个字符串出现在所有3个随机样本中,那么我希望overlap_3_lists包含值1.我想自动化,以便我得到值的分布,以便我可以绘制直方图到查看在所有3个列表中采样的0,1,2,3等相同字符串的次数。

You could also try using the mapply() , slightly more readable, like this: 您也可以尝试使用mapply() ,稍微更具可读性,如下所示:

my_list <- list( A= 1:8, B= 1:8, C= 1:8)

my_list_sampled <- mapply(sample, size = c(5,5,3), my_list )
names(my_list_sampled) <- names(my_list)


result<- table(stack(my_list_sampled))

hist(result)

This will nicely summarize the data and you can subset based on the number of observations. 这将很好地总结数据,您可以根据观察的数量进行分组。

result_all_3 <- (result == "3")

Or count the overlap like this 或者像这样计算重叠

result <- data.frame(ifelse(result> 0, 1, 0))

result$overlap <- rowSums(result)

hist(result$overlap)

You'll want to change 20 to 30 in your third sample. 您需要在第三个样本中更改20到30。 Also, your combined_randomgenes needs to reference the sample_listx. 另外,你的combined_randomgenes需要引用sample_listx。 Then just put the for loop code around it and assign the results. 然后只需将for循环代码放在它周围并分配结果。 Bonus tips: be wary of using subset in a script & set the seed so that your work is reproducible. 额外提示:警惕在脚本中使用subset并设置种子,以便您的工作可重现。

set.seed(1234)

list1 <- 1:60
list2 <- 1:60
list3 <- 1:60

n <- 100
runs <- data.frame(run=1:n,threes=NA,twos=NA)
for(i in 1:n) {
  sample_list1 <-(sample(list1,10, replace=FALSE))
  sample_list2 <-(sample(list2,20, replace=FALSE))
  sample_list3 <-(sample(list3,30, replace=FALSE))

  combined_randomgenes <- c(sample_list1, sample_list2, sample_list3)
  combined_counts <- as.data.frame(table(combined_randomgenes))

  runs$threes[i] <- sum(combined_counts$Freq==3)
  runs$twos[i] <- sum(combined_counts$Freq==2)
}

runs
hist(runs$threes,5)
hist(runs$twos,5)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM