简体   繁体   English

样本量为 1 是否考虑水库采样?

[英]Is sample size of 1 consider Reservoir Sampling?

I just want to know that my code is reservoir sampling.我只想知道我的代码是水库采样。 I have a stream of pageviews that I just want to process.我有一个我只想处理的综合浏览量流。 I'm processing one pageview at a time.我一次处理一个综合浏览量。 However, since most of the pageviews are the same so I just want to randomly pick any pageview (one at a time to process).但是,由于大多数综合浏览量都相同,所以我只想随机选择任何综合浏览量(一次处理一个)。 For example, I have a pageview of例如,我有一个页面浏览量

[www.example.com, www.example.com, www.example1.com, www.example3.com, ...]

I'm processing one element at a time.我一次处理一个元素。 Here's my code.这是我的代码。

import random

def __init__(self):
  self.counter = 0

def processable():
  self.counter += 1
  return random.random() < 1.0 / self.counter

Following the algorithm for the reservoir sampling (can be found here: https://en.wikipedia.org/wiki/Reservoir_sampling ) where we store just one pageview (reservoir size=1), the following implementation shows that how the strategy of probabilistic selection from the streaming pageviews leads to a uniform selection probabilities:遵循水库采样算法(可以在这里找到: https : //en.wikipedia.org/wiki/Reservoir_sampling ),我们只存储一个页面浏览量(水库大小=1),以下实现显示了概率策略如何从流媒体浏览量中进行选择会产生统一的选择概率:

import numpy as np
import matplotlib.pyplot as plt
max_num = 10 # maximum number of pageviews we want to consider
# replicate the experiment ntrials times and find the probability for selection of any pageview
pageview_indices = []
ntrials = 10000
for _ in range(ntrials):
    pageview_index = None # index of the single pageview to be kept
    i = 0
    while True: # streaming pageviews
        i += 1 # next pageview
        if i > max_num:
            break
        # keep first pageview and from next pageview onwards discard the old one kept with probability 1 - 1/i
        pageview_index = 1 if i == 1 else np.random.choice([pageview_index, i], 1, p=[1-1./i, 1./i])[0]
        #print 'pageview chosen:',  pageview_index
    print 'Final pageview chosen:',  pageview_index
    pageview_indices.append(pageview_index)
plt.hist(pageview_indices, max_num, normed=1, facecolor='green', alpha=0.75)
plt.xlabel('Pageview Index')
plt.ylabel('Probability Chosen')
plt.title('Reservoir Sampling')
plt.axis([0, max_num+1, 0, 0.15])
plt.xticks(range(1, max_num+1))
plt.grid(True)

在此处输入图片说明

As can be seen from above, the probability of the pageview indices chosen is almost uniform (1/10 for each of 10 pageviews), it can be mathematically proved to be uniform too.从上面可以看出,选择的浏览量指数的概率几乎是一致的(10 次浏览量中的每一个为 1/10),它也可以在数学上证明是一致的。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM