简体   繁体   English

从数据分位数中随机抽样,同时保留原始概率分布

[英]Random sampling from data quantiles, while preserving original probability distribution

Following my previous question titled: " Random sampling from a dataset, while preserving original probability distribution ", I want to sample from a set of >2000 numbers, gathered from measurement. 在上一个题为“ 从数据集中随机抽样,同时保留原始概率分布 ”的问题之后,我想从测量中收集的> 2000个数字中进行抽样。 I want to perform several tests (I take maximum of 10 samples in each tests), while preserving probability distribution in overall testiong process, and in each test (as much as possible). 我要执行几个测试(每个测试最多要抽取10个样本),同时保留整个测试过程和每个测试中的概率分布(尽可能)。 Now, instead of completely random sampling, I partition data into 5 quantiles, and in 10 tests, I sample 2 data elements from each quantile, using a uniformly random distribution for the array of data in each quantile. 现在,我将数据分成5个分位数,而不是完全随机采样,然后在10个测试中,使用每个分位数中数据数组的均匀随机分布,从每个分位数中采样2个数据元素。

The problem with the completely random sampling was that as the distribution of data is long-tailed, I was getting almost the same values in each test. 完全随机采样的问题在于,由于数据的分配是长尾的,因此每次测试中我得到的值几乎相同。 I want some small value samples, some middle value samples, and some big value samples in each test. 在每个测试中,我需要一些小价值样本,一些中价值样本和一些大价值样本。 So I sampled as described. 因此,我按照说明进行了采样。

数据密度图

Fig 1. Density plot of ~2k elements of data. 图1.〜2k数据元素的密度图。

This is the R code for calculating quantiles: 这是用于计算分位数的R代码:

q=quantile(data, probs = seq(0, 1, by= 0.1))

And then I partition data into 5 quantiles (each one as an array) and sample from each partition. 然后,我将数据划分为5个分位数(每个分位数为一个数组)并从每个分区中采样。 For example, I do this in Java: 例如,我在Java中执行此操作:

public int getRandomData(int quantile) {
    int data[][] = {1,2,3,4,5}
                  ,{6,7,8,9,10}
                  ,{11,12,13,14,15}
                  ,{16,17,18,19,20}
                  ,{21,22,23,24,25}};
    length=data[quantile][].length;
    Random r=new Random();
    int randomInt = r.nextInt(length);
    return data[quantile][randomInt];
}

So, does the samples for each tests and all tests overall, preserve the characteristics of the original distribution, for example mean and variance? 那么,每个测试和所有测试的样本是否都保留了原始分布的特征,例如均值和方差? If not, how to arrange sampling to achieve this goal? 如果没有,如何安排抽样以实现这一目标?

preserve the characteristics of the original distribution, for example mean and variance? 保留原始分布的特征,例如均值和方差?

This will have a similar distribution. 这将具有类似的分布。 You might want to have an additional check to ensure it meets your requirement, and perhaps try again, but this will get you close. 您可能需要进行其他检查,以确保它满足您的要求,也许再试一次,但这将使您与您紧密联系。

If not, how to arrange sampling to achieve this goal? 如果没有,如何安排抽样以实现这一目标?

Unless you have duplication of all data ie double everything, you need to have one of every sample value. 除非您具有所有数据的重复,即将所有数据都加倍,否则您需要具有每个样本值之一。 This is the only way to get exactly the same distribution. 这是获得完全相同分布的唯一方法。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM