迭代或惰性水库采样

Question

I'm fairly well acquainted with using Reservoir Sampling to sample from a set of undetermined length in a single pass over the data.我非常熟悉使用 Reservoir Sampling 从一组未确定长度的数据中一次性采样。 One limitation of this approach, in my mind, is that it still requires a pass over the entire data set before any results can be returned.在我看来，这种方法的一个限制是它仍然需要遍历整个数据集才能返回任何结果。 Conceptually this makes sense, since one has to allow items in the entirety of the sequence the opportunity to replace previously encountered items to achieve a uniform sample.从概念上讲，这是有道理的，因为必须允许整个序列中的项目有机会替换以前遇到的项目以实现统一样本。

Is there a way to be able to yield some random results before the entire sequence has been evaluated?有没有办法在评估整个序列之前产生一些随机结果？ I'm thinking of the kind of lazy approach that would fit well with python's great itertools library.我正在考虑那种很适合 python 伟大的 itertools 库的懒惰方法。 Perhaps this could be done within some given error tolerance?也许这可以在给定的容错范围内完成？ I'd appreciate any sort of feedback on this idea!我很感激关于这个想法的任何反馈！

Just to clarify the question a bit, this diagram sums up my understanding of the in-memory vs. streaming tradeoffs of different sampling techniques.只是为了稍微澄清一下这个问题，这张图总结了我对不同采样技术的内存中与流媒体权衡的理解。 What I want is something that falls into the category of Stream Sampling , where we don't know the length of the population beforehand.我想要的是属于Stream Sampling类别的东西，我们事先不知道人口的长度。

在此处输入图片说明

Clearly there is a seeming contradiction in not knowing the length a priori and still getting a uniform sample, since we will most likely bias the sample towards the beginning of the population.显然，在不知道先验长度的情况下仍然得到一个统一的样本似乎是矛盾的，因为我们很可能会将样本偏向于总体的开始。 Is there a way to quantify this bias?有没有办法量化这种偏见？ Are there tradeoffs to be made?是否需要进行权衡？ Does anybody have a clever algorithm to solve this problem?有没有人有一个聪明的算法来解决这个问题？

Answer 1

If you know in advance the total number of items that will be yielded by an iterable population , it is possible to yield the items of a sample of population as you come to them (not only after reaching the end).如果您事先知道可迭代population将产生的项目population ，则有可能在您到达时（不仅是在到达结束后）产生population样本的项目。 If you don't know the population size ahead of time, this is impossible (as the probability of any item being in the sample can't be be calculated).如果您事先不知道总体规模，这是不可能的（因为无法计算任何项目出现在样本中的概率）。

Here's a quick generator that does this:这是一个执行此操作的快速生成器：

def sample_given_size(population, population_size, sample_size):
    for item in population:
        if random.random() < sample_size / population_size:
            yield item
            sample_size -= 1
        population_size -= 1

Note that the generator yields items in the order they appear in the population (not in random order, like random.sample or most reservoir sampling codes), so a slice of the sample will not be a random subsample!请注意，生成器按照它们在总体中出现的顺序生成项目（不是随机顺序，如random.sample或大多数水库采样代码），因此样本的一部分不会是随机子样本！

Answer 2

If population size is known before hand, can't you just generate sample_size random "indices" (in the stream) and use that to do a lazy yield?如果事先知道人口规模，你不能只生成 sample_size 随机“指数”（在流中）并使用它来做一个懒惰的产量吗？ You won't have to read the entire stream.您不必阅读整个流。

For instance, if population_size was 100, and sample_size was 3, you generate a random set of integers from 1 to 100, say you get 10, 67 and 72.例如，如果population_size 为100，sample_size 为3，则生成从1 到100 的随机整数集，假设您得到10、67 和72。

Now you yield the 10th, 62nd and 72nd elements of the stream and ignore the rest.现在您生成流的第 10、62 和 72 个元素并忽略其余元素。

I guess I don't understand the question.我想我不明白这个问题。

迭代或惰性水库采样

问题描述

2 个解决方案

解决方案1
6 已采纳 2014-06-11 23:26:40

解决方案2
0 2014-06-12 04:32:44

迭代或惰性水库采样

问题描述

2 个解决方案

解决方案1 6 已采纳 2014-06-11 23:26:40

解决方案2 0 2014-06-12 04:32:44

解决方案1
6 已采纳 2014-06-11 23:26:40

解决方案2
0 2014-06-12 04:32:44