简体   繁体   English

如何根据R中另一个数据集的分布对数据进行采样

[英]How to sample data based off the distribution of another dataset in R

I would like to sample a large dataset based on the distribution of a smaller dataset in R. I have been searching for a solution for some time without success. 我想根据R中较小数据集的分布对大型数据集进行抽样。一段时间以来,我一直在寻找解决方案,但没有成功。 I am relatively new in R so I apologize if this is straightforward. 我在R中相对较新,因此,如果这很简单,我深表歉意。 However, I have tried some solutions . 但是,我已经尝试了一些解决方案

Here are some sample data. 以下是一些示例数据。 I'll call it observed and model: 我将其称为观察模型:

# Set seed
set.seed(2)

# Create smaller observed data
Obs <- rnorm(1000, 5, 2.5)

# Create larger modeled data
set.seed(2)
Model <- rnorm(10000, 8, 1.5)

The distributions of the two datasets are as follows: 这两个数据集的分布如下: 在此处输入图片说明

Goal: I would like to sample the larger "model" dataset to match the smaller "observed". 目标:我想对较大的“模型”数据集进行抽样,以匹配较小的“观察”数据集。 I understand that there are different data points involved so it won't be a direct match. 我了解涉及不同的数据点,因此不会直接匹配。

I have been reading up on the density() and sample() where I do the following: 我一直在阅读density()sample() ,其中执行以下操作:

# Obtain the density of the observed at the length of the model.
# Note: info on the sample() function stated the prob argument in the sample() function 
# must be the same length as what's being sampled. Thus, n=length(Model) below.

dens.obs <- density(Obs, n=length(Model))

# Sample the Model data the length(Obs) at the probability of density of the observed
set.seed(22)
SampleMod <- sample(Model, length(Obs), replace=FALSE, prob=dens.obs$y)

This gives me the new plot that looks very similar to the old (except for the tails): 这给了我一个新图,看起来与旧图非常相似(尾巴除外): 在此处输入图片说明

I was hoping for a better match. 我希望有更好的比赛。 Therefore I started explored using the density function on the model data. 因此,我开始探索在模型数据上使用密度函数。 See below: 见下文:

# Density function on model, length of model
dens.mod <- density(Model, n=length(Model))

# Sample the density of the model $x at the density of the observed $ y
set.seed(22)
SampleMod3 <- sample(dens.mod$x, length(Obs), replace=FALSE, prob=dens.obs$y)

Here are two plots, the first is the same as the first sampled and the second is the second sampled: 这是两个图,第一个与第一个采样相同,第二个与第二个采样相同: 在此处输入图片说明

There is a more desirable shift in the right plot, which represents the sampled density of the modeled by the density of the observed. 在右图中有一个更理想的偏移,该偏移表示通过观察到的密度来建模的采样密度。 However, the data are not the same. 但是,数据并不相同。 That is, I did NOT sample the Modeled data. 也就是说,我没有采样建模数据。 See below: 见下文:

summary(SampleMod3 %in% Model)

produces: 产生:

   Mode   FALSE    NA's 
logical    1000       0 

Indicating that I did not sample the modeled data, but rather the density of the modeled data. 表示我没有采样建模数据,而是采样了数据的密度。 Is it possible to sample a dataset based on the distribution of another dataset? 是否可以根据另一个数据集的分布对一个数据集进行采样? Thank you in advance. 先感谢您。

EDIT: 编辑:

Thanks for all the help guys! 感谢所有的帮助! Here is my plot using approxfun() function offered from danielson and supported by bethanyp. 这是我使用approxfun()提供并由bethanyp支持的roxfun approxfun()函数的图。

在此处输入图片说明

Any help with understanding why the funky new distribution? 对理解为何要分发新版时髦消息有帮助吗?

Interesting question. 有趣的问题。 I think this will help. 我认为这会有所帮助。 First, it approximates the density function. 首先,它近似密度函数。 Then, it samples from the Model points with the fitted density's probabilities. 然后,使用拟合密度的概率从“模型”点进行采样。

predict_density = approxfun(dens.obs) #function that approximates dens.obs
#sample points from Model with probability distr. of dens.obs
SampleMod3 <- sample(Model, length(Obs), replace=FALSE, prob=predict_density(Model))
summary(SampleMod3 %in% Model)
   Mode    TRUE    NA's 
logical    1000       0 

I assume that in practice you are using a real set of non-randomly generated data. 我假设实际上您使用的是一组真实的非随机生成的数据。 In which case the likely values of the different samples have a probability of coming up because random sampling method does not mean no pattern in the data. 在这种情况下,由于随机采样方法并不意味着数据中没有模式,因此不同样本的可能值可能会上升。 In the wilderness real things have real frequencies, which will show in your meta-sample. 在旷野中,真实事物具有真实的频率,这些频率会在您的元样本中显示。

So you should use the weighted probabilities in selecting your smaller sub-sample from the original. 因此,您应该使用加权概率从原始样本中选择较小的子样本。

Example the whole population {1,2,1,3,4,1,3} where probabilities for each number being drawn (remember the sum must equal 1): 1 : .4285 2 :.1429 3: .2857 4: .1429 以整个人口{1,2,1,3,4,1,3}为例,绘制每个数字的概率(记住总和必须等于1):1:.4285 2:.1429 3:.2857 4:。 1429

if you use these weighted probabilities in the prob= my_freqs part of 如果在以下prob= my_freqsprob= my_freqs部分中使用这些加权概率

sample(x, size, replace = FALSE, prob = my_freqs)

You will likely obtain a probability more inline with what you were expecting. 您可能会获得与预期更加内联的概率。 But I am not 100% sure if this is what you are after. 但是我不是100%知道这是否是您所追求的。

In the random data, try set.seed(2) and see if telling R to use the seed used to generate those frequencies in the original set creation gets you closer to your goal. 在随机数据中,尝试使用set.seed(2) ,看看是否告诉R在原始集合创建中使用用于生成那些频率的种子使您更接近目标。

I know that there is a universal random formula associated with each set. 我知道每个集合都有一个通用的随机公式。 I would have to assume it is a set of frequency probabilities of a method of generating them for various sets of random methods, so it might help you o use that prior to sampling from the random sets. 我必须假设这是为各种随机方法集生成频率概率的一种方法的频率概率集,因此它可能会帮助您在从随机集进行采样之前使用它。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM