简体   繁体   English

R中均匀分布的随机子样本

[英]random subsample with uniform distribution in R

I have a large dataset containing observations of Vegetation Indicies (VI).我有一个包含植被指数 (VI) 观测值的大型数据集。 I'm using R to randomly subsample the data while keeping the distribution (rel. frequency) uniform (equal amounts of observations over the entire VI range).我使用 R 对数据进行随机子采样,同时保持分布(相对频率)均匀(在整个 VI 范围内的观察量相等)。 I haven't been able to get quite an even distribution.我一直无法获得相当均匀的分布。

Example:示例:

norm<-rnorm(1000, mean = .5, sd = .25) # I have this 

hist(norm) #that is distributed like this

hist(unif<-runif(1000, min=0, max=1)) # but I want to resample the data to look like this

How about this: divide the range of VI into bins of equal width and put the data into these bins.这个怎么样:把VI的范围分成等宽的bins,把数据放到这些bins里。 There will be more data in the bins in the middle of the distribution than toward the ends.分布中间的 bin 中的数据将多于末端的数据。 Choose at bin at random (with equal probability) and then choose one item from the bin.在 bin 中随机选择(等概率),然后从 bin 中选择一项。

One variation on that idea is to choose a point in the range of VI at random (with equal probability) and then find the data which fall within an interval from (x - dx/2) to (x + dx/2) where dx is big enough to catch at least a few data.该想法的一种变体是随机选择 VI 范围内的一个点(以相等的概率),然后找到落在 (x - dx/2) 到 (x + dx/2) 区间内的数据,其中 dx足够大,至少可以捕获一些数据。 Then choose one datum from that interval (with equal probability).然后从该区间中选择一个数据(概率相等)。 There are probably many more variations.可能还有更多的变化。

One consequence of non-uniform sampling like that is that you might select the same items from the tails over and over.像这样的非均匀采样的一个后果是您可能会一遍又一遍地从尾部选择相同的项目。 I don't see a way around that;我看不出有什么办法解决这个问题; it seems to be an inevitable consequence.这似乎是不可避免的后果。 But I could be wrong about that.但我可能错了。

Sample with inverted input distribution weighting具有反向输入分布权重的样本

Aha!啊哈! I've thought of a second solution, which I think is probably better than my first, which I've kept under the section Repeated target distribution nearest-match selection below.我想到了第二个解决方案,我认为它可能比我的第一个更好,我将其保留在下面的重复目标分布最近匹配选择部分下。

The sample() function has a prob parameter which allows us to specify probability weights for the elements of the input vector. sample()函数有一个prob参数,它允许我们为输入向量的元素指定概率权重。 We can use this parameter to increase the probability of selecting elements that occur in sparser segments of the input distribution (that is, the tails) and decrease the probability of selecting elements that occur in denser segments (that is, the center).我们可以使用此参数来增加选择出现在输入分布的较稀疏段(即尾部)中的元素的概率,并降低选择出现在较密集段(即中心)中的元素的概率。 I think a simple arithmetic inversion of the density function dnorm() will be sufficient:我认为密度函数dnorm()的简单算术反演就足够了:

Test data测试数据

set.seed(1L);
normSize <- 1e4L; normMean <- 0.5; normSD <- 0.25;
norm <- rnorm(normSize,normMean,normSD);

Solution解决方案

unifSize <- 1e3L; unifMin <- 0; unifMax <- 1;
normForUnif <- norm[norm>=unifMin & norm<=unifMax];
d <- dnorm(normForUnif,normMean,normSD);
unif <- sample(normForUnif,unifSize,prob=1/d);
hist(unif);

hist-unif-1


Repeated target distribution nearest-match selection重复目标分布最近匹配选择

Generate a set of random deviates from your target (uniform) distribution.生成一组与目标(均匀)分布的随机偏差。 For each deviate, find the element from the input (normal) distribution that is closest to it.对于每个偏差,从输入(正态)分布中找到最接近它的元素。 Consider that element to be selected for the sample.考虑要为样本选择的元素。

Repeat the above until the number of unique selections reaches or surpasses the required size of the sample.重复上述操作,直到唯一选择的数量达到或超过所需的样本大小。 If it surpassed the required size, truncate it to exactly the required size.如果它超过了所需的大小,请将其截断为所需的大小。


We can use findInterval() to find the closest normal deviate for each uniform deviate.我们可以使用findInterval()为每个均匀偏差找到最接近的法线偏差。 This requires a couple of conciliations to get right.这需要几次调解才能正确。 We must sort the normal distribution vector, since findInterval() requires vec to be sorted.我们必须对正态分布向量进行排序,因为findInterval()需要对vec进行排序。 And instead of using zero, the true minimum of the target distribution, as the minimum we pass to runif() , we must pass the lowest value not lower than zero that exists in the input set;而不是使用零,目标分布的真实最小值,作为我们传递给runif()的最小值,我们必须传递输入集中存在的不低于零的最小值; otherwise, a uniform deviate below that value would match an input element below the acceptable minimum of the uniform distribution.否则,低于该值的均匀偏差将匹配低于可接受的均匀分布最小值的输入元素。 Also, for efficiency, before running the loop which calls findInterval() , it is a good idea to remove all values that are not within the target distribution's acceptable range (that is, [0,1]) from the normal distribution vector, so they will not participate in the matching algorithm.此外,为了提高效率,在运行调用findInterval()的循环之前,最好从正态分布向量中删除不在目标分布可接受范围(即 [0,1])内的所有值,因此他们不会参与匹配算法。 They are not needed, because they could not be matched anyway.它们不是必需的,因为无论如何它们都无法匹配。

Provided the target sample size is smaller than the input distribution vector by a sufficient margin, this should eliminate any trace of the input distribution in the resulting sample.如果目标样本大小比输入分布向量小足够的余量,这应该消除结果样本中输入分布的任何痕迹。

Test Data测试数据

set.seed(1L);
normSize <- 1e6L; normMean <- 0.5; normSD <- 0.25;
norm <- rnorm(normSize,normMean,normSD);

Solution解决方案

unifSize <- 200L; unifMin <- 0; unifMax <- 1;
normVec <- sort(norm[norm>=unifMin & norm<=unifMax]);
inds <- integer();
repeat {
    inds <- unique(c(inds,findInterval(runif(unifSize*2L,normVec[1L],unifMax),normVec)));
    if (length(inds)>=unifSize) break;
};
length(inds) <- unifSize;
unif <- normVec[inds];
hist(unif);

历史统一

One caveat is that findInterval() doesn't technically find the nearest element, it finds the element that is less than or equal to the search value.一个警告是findInterval()在技​​术上不会找到最近的元素,它会找到小于或等于搜索值的元素。 I don't think this will have any significant impact on the result;我认为这不会对结果产生任何重大影响; at most, it will infinitesimally bias the selections in favor of smaller values, but in a uniform way.至多,它会以一种统一的方式无限偏向选择更小的值。 If you really want, you can take a look at the various find-nearest options that exist, eg see R: find nearest index .如果你真的想要,你可以看看存在的各种 find-nearest 选项,例如参见R: find Nearest index

You can use runif function from stats package in R in a loop with different seeds.您可以在具有不同种子的循环中使用 R 中 stats 包中的runif函数。 Let's say you want to make 100 subsamples and merge them at the end, then this should do the job:假设您想要制作 100 个子样本并在最后合并它们,那么这应该可以完成工作:

list_of_uniformsamples <- vector("list", length = 100)
for (i in 1:100){
set.seed(123+i)
list_of_uniformsamples[[i]] <- round(runif(1000, min=1, max=Number_of_observations))
}
pool_of_uniform_samples <- unlist(list_of_uniformsamples)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM