I have a large dataset containing observations of Vegetation Indicies (VI). I'm using R to randomly subsample the data while keeping the distribution (rel. frequency) uniform (equal amounts of observations over the entire VI range). I haven't been able to get quite an even distribution.
Example:
norm<-rnorm(1000, mean = .5, sd = .25) # I have this
hist(norm) #that is distributed like this
hist(unif<-runif(1000, min=0, max=1)) # but I want to resample the data to look like this
How about this: divide the range of VI into bins of equal width and put the data into these bins. There will be more data in the bins in the middle of the distribution than toward the ends. Choose at bin at random (with equal probability) and then choose one item from the bin.
One variation on that idea is to choose a point in the range of VI at random (with equal probability) and then find the data which fall within an interval from (x - dx/2) to (x + dx/2) where dx is big enough to catch at least a few data. Then choose one datum from that interval (with equal probability). There are probably many more variations.
One consequence of non-uniform sampling like that is that you might select the same items from the tails over and over. I don't see a way around that; it seems to be an inevitable consequence. But I could be wrong about that.
Aha! I've thought of a second solution, which I think is probably better than my first, which I've kept under the section Repeated target distribution nearest-match selection below.
The sample()
function has a prob
parameter which allows us to specify probability weights for the elements of the input vector. We can use this parameter to increase the probability of selecting elements that occur in sparser segments of the input distribution (that is, the tails) and decrease the probability of selecting elements that occur in denser segments (that is, the center). I think a simple arithmetic inversion of the density function dnorm()
will be sufficient:
Test data
set.seed(1L);
normSize <- 1e4L; normMean <- 0.5; normSD <- 0.25;
norm <- rnorm(normSize,normMean,normSD);
Solution
unifSize <- 1e3L; unifMin <- 0; unifMax <- 1;
normForUnif <- norm[norm>=unifMin & norm<=unifMax];
d <- dnorm(normForUnif,normMean,normSD);
unif <- sample(normForUnif,unifSize,prob=1/d);
hist(unif);
Generate a set of random deviates from your target (uniform) distribution. For each deviate, find the element from the input (normal) distribution that is closest to it. Consider that element to be selected for the sample.
Repeat the above until the number of unique selections reaches or surpasses the required size of the sample. If it surpassed the required size, truncate it to exactly the required size.
We can use findInterval()
to find the closest normal deviate for each uniform deviate. This requires a couple of conciliations to get right. We must sort the normal distribution vector, since findInterval()
requires vec
to be sorted. And instead of using zero, the true minimum of the target distribution, as the minimum we pass to runif()
, we must pass the lowest value not lower than zero that exists in the input set; otherwise, a uniform deviate below that value would match an input element below the acceptable minimum of the uniform distribution. Also, for efficiency, before running the loop which calls findInterval()
, it is a good idea to remove all values that are not within the target distribution's acceptable range (that is, [0,1]) from the normal distribution vector, so they will not participate in the matching algorithm. They are not needed, because they could not be matched anyway.
Provided the target sample size is smaller than the input distribution vector by a sufficient margin, this should eliminate any trace of the input distribution in the resulting sample.
Test Data
set.seed(1L);
normSize <- 1e6L; normMean <- 0.5; normSD <- 0.25;
norm <- rnorm(normSize,normMean,normSD);
Solution
unifSize <- 200L; unifMin <- 0; unifMax <- 1;
normVec <- sort(norm[norm>=unifMin & norm<=unifMax]);
inds <- integer();
repeat {
inds <- unique(c(inds,findInterval(runif(unifSize*2L,normVec[1L],unifMax),normVec)));
if (length(inds)>=unifSize) break;
};
length(inds) <- unifSize;
unif <- normVec[inds];
hist(unif);
One caveat is that findInterval()
doesn't technically find the nearest element, it finds the element that is less than or equal to the search value. I don't think this will have any significant impact on the result; at most, it will infinitesimally bias the selections in favor of smaller values, but in a uniform way. If you really want, you can take a look at the various find-nearest options that exist, eg see R: find nearest index .
You can use runif
function from stats package in R in a loop with different seeds. Let's say you want to make 100 subsamples and merge them at the end, then this should do the job:
list_of_uniformsamples <- vector("list", length = 100)
for (i in 1:100){
set.seed(123+i)
list_of_uniformsamples[[i]] <- round(runif(1000, min=1, max=Number_of_observations))
}
pool_of_uniform_samples <- unlist(list_of_uniformsamples)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.