简体   繁体   中英

random subsample with uniform distribution in R

I have a large dataset containing observations of Vegetation Indicies (VI). I'm using R to randomly subsample the data while keeping the distribution (rel. frequency) uniform (equal amounts of observations over the entire VI range). I haven't been able to get quite an even distribution.

Example:

norm<-rnorm(1000, mean = .5, sd = .25) # I have this 

hist(norm) #that is distributed like this

hist(unif<-runif(1000, min=0, max=1)) # but I want to resample the data to look like this

How about this: divide the range of VI into bins of equal width and put the data into these bins. There will be more data in the bins in the middle of the distribution than toward the ends. Choose at bin at random (with equal probability) and then choose one item from the bin.

One variation on that idea is to choose a point in the range of VI at random (with equal probability) and then find the data which fall within an interval from (x - dx/2) to (x + dx/2) where dx is big enough to catch at least a few data. Then choose one datum from that interval (with equal probability). There are probably many more variations.

One consequence of non-uniform sampling like that is that you might select the same items from the tails over and over. I don't see a way around that; it seems to be an inevitable consequence. But I could be wrong about that.

Sample with inverted input distribution weighting

Aha! I've thought of a second solution, which I think is probably better than my first, which I've kept under the section Repeated target distribution nearest-match selection below.

The sample() function has a prob parameter which allows us to specify probability weights for the elements of the input vector. We can use this parameter to increase the probability of selecting elements that occur in sparser segments of the input distribution (that is, the tails) and decrease the probability of selecting elements that occur in denser segments (that is, the center). I think a simple arithmetic inversion of the density function dnorm() will be sufficient:

Test data

set.seed(1L);
normSize <- 1e4L; normMean <- 0.5; normSD <- 0.25;
norm <- rnorm(normSize,normMean,normSD);

Solution

unifSize <- 1e3L; unifMin <- 0; unifMax <- 1;
normForUnif <- norm[norm>=unifMin & norm<=unifMax];
d <- dnorm(normForUnif,normMean,normSD);
unif <- sample(normForUnif,unifSize,prob=1/d);
hist(unif);

hist-unif-1


Repeated target distribution nearest-match selection

Generate a set of random deviates from your target (uniform) distribution. For each deviate, find the element from the input (normal) distribution that is closest to it. Consider that element to be selected for the sample.

Repeat the above until the number of unique selections reaches or surpasses the required size of the sample. If it surpassed the required size, truncate it to exactly the required size.


We can use findInterval() to find the closest normal deviate for each uniform deviate. This requires a couple of conciliations to get right. We must sort the normal distribution vector, since findInterval() requires vec to be sorted. And instead of using zero, the true minimum of the target distribution, as the minimum we pass to runif() , we must pass the lowest value not lower than zero that exists in the input set; otherwise, a uniform deviate below that value would match an input element below the acceptable minimum of the uniform distribution. Also, for efficiency, before running the loop which calls findInterval() , it is a good idea to remove all values that are not within the target distribution's acceptable range (that is, [0,1]) from the normal distribution vector, so they will not participate in the matching algorithm. They are not needed, because they could not be matched anyway.

Provided the target sample size is smaller than the input distribution vector by a sufficient margin, this should eliminate any trace of the input distribution in the resulting sample.

Test Data

set.seed(1L);
normSize <- 1e6L; normMean <- 0.5; normSD <- 0.25;
norm <- rnorm(normSize,normMean,normSD);

Solution

unifSize <- 200L; unifMin <- 0; unifMax <- 1;
normVec <- sort(norm[norm>=unifMin & norm<=unifMax]);
inds <- integer();
repeat {
    inds <- unique(c(inds,findInterval(runif(unifSize*2L,normVec[1L],unifMax),normVec)));
    if (length(inds)>=unifSize) break;
};
length(inds) <- unifSize;
unif <- normVec[inds];
hist(unif);

历史统一

One caveat is that findInterval() doesn't technically find the nearest element, it finds the element that is less than or equal to the search value. I don't think this will have any significant impact on the result; at most, it will infinitesimally bias the selections in favor of smaller values, but in a uniform way. If you really want, you can take a look at the various find-nearest options that exist, eg see R: find nearest index .

You can use runif function from stats package in R in a loop with different seeds. Let's say you want to make 100 subsamples and merge them at the end, then this should do the job:

list_of_uniformsamples <- vector("list", length = 100)
for (i in 1:100){
set.seed(123+i)
list_of_uniformsamples[[i]] <- round(runif(1000, min=1, max=Number_of_observations))
}
pool_of_uniform_samples <- unlist(list_of_uniformsamples)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM