简体   繁体   English

通过自举计算相关系数

[英]Calculate correlation coefficient by bootstrapping

I'm looking at the correlation between the day of the year that 5 species of bird started moulting their feathers and the numbers of days it took these 5 species to complete the moulting of their feathers. 我正在研究5种鸟类一年中开始换羽的日子与这5种鸟类完成羽毛换羽所花费的天数之间的相关性。

I've tried to simulate my data in the code below. 我试图在下面的代码中模拟我的数据。 For each of the 5 species, I have start day for 10 individuals and the durations for 10 individuals. 对于这5个物种中的每一个,我都有10个人的开始日和10个人的持续时间。 For each species, I calculated the mean start day and mean duration then calculated the correlation across these 5 species. 对于每种物种,我计算了平均开始日和平均持续时间,然后计算了这5种物种之间的相关性。

What I want to do is bootstrap the mean start date and bootstrap the mean duration for each species. 我想要做的是引导每个物种的平均开始日期和引导平均时间。 I want to repeat this 10,000 times and calculate the correlation coefficient after each repeat. 我想重复10,000次,并在每次重复后计算相关系数。 I then want to extract the 0.025, 0.5 and 0.975 quantiles of the 10,000 correlation coefficients. 然后,我要提取10,000个相关系数的0.025、0.5和0.975分位数。

I got as far as simulating the raw data, but my code quickly got messy once I tried to bootstrap. 我可以模拟原始数据,但是一旦尝试进行引导,我的代码很快就会变得混乱。 Can anyone help me with this? 谁能帮我这个?

# speciesXX_start_day is the day of the year that 10 individuals of birds started moulting their feathers
# speciesXX_duration is the number of days that each individuals bird took to complete the moulting of its feathers
species1_start_day <- as.integer(rnorm(10, 10, 2))
species1_duration <- as.integer(rnorm(10, 100, 2))

species2_start_day <- as.integer(rnorm(10, 20, 2))
species2_duration <- as.integer(rnorm(10, 101, 2))

species3_start_day <- as.integer(rnorm(10, 30, 2))
species3_duration <- as.integer(rnorm(10, 102, 2))

species4_start_day <- as.integer(rnorm(10, 40, 2))
species4_duration <- as.integer(rnorm(10, 103, 2))

species5_start_day <- as.integer(rnorm(10, 50, 2))
species5_duration <- as.integer(rnorm(10, 104, 2))

start_dates <- list(species1_start_day, species2_start_day, species3_start_day, species4_start_day, species5_start_day)
start_duration <- list(species1_duration, species2_duration, species3_duration, species4_duration, species5_duration)

library(plyr)

# mean start date for each of the 5 species
starts_mean <- laply(start_dates, mean)

# mean duration for each of the 5 species
durations_mean <- laply(start_duration, mean)

# correlation between start date and duration
cor(starts_mean, durations_mean)

R allows you to resample datasets with the sample function. R允许您使用sample函数对数据集重新采样。 In order to bootstrap you can just take random samples (with replacement) of your original dataset and then recalculate the statistics for each subsample. 为了进行引导,您可以只对原始数据集进行随机采样(替换),然后重新计算每个子采样的统计信息。 You can save the intermediate results in a datastructure so that you can process the data afterwards. 您可以将中间结果保存在数据结构中,以便以后可以处理数据。

A possible example solution for your specific problem is added below. 下面添加了针对您的特定问题的可能示例解决方案。 We take 10000 subsamples of size 3 for each of the species, calculate the statistics and then save the results in a list or vector. 我们为每个物种抽取10000个大小为3的子样本,计算统计量,然后将结果保存在列表或向量中。 After the bootstrap we are able to process all the data: 引导后,我们可以处理所有数据:

nrSamples = 10000;
listOfMeanStart = list(nrSamples)
listOfMeanDuration = list(nrSamples)
correlations <- vector(mode="numeric", length=nrSamples)

for(i in seq(1,nrSamples))
{
  sampleStartDate = sapply(start_dates,sample,size=3,replace=TRUE)
  sampleDurations = sapply(start_duration,sample,size=3,replace=TRUE)

  listOfMeans[[i]] <- apply(sampleStartDate,2,mean) 
  listOfMeanDuration[[i]] <- apply(sampleDurations,2,mean)
  correlations[i] <- cor(listOfMeans[[i]], listOfMeanDuration[[i]])
}

quantile(correlations,c(0.025,.5,0.975))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM