Haskell 中的无限/延迟水库采样

Question

I tried to implement a simple reservoir sampling in haskell following http://jeremykun.com/2013/07/05/reservoir-sampling/ (note that the algorithm shown is possibly semantically incorrect)我尝试在http://jeremykun.com/2013/07/05/reservoir-sampling/ 之后在 haskell 中实现一个简单的水库采样（请注意，显示的算法可能在语义上不正确）

According to this: Iterative or Lazy Reservoir Sampling lazy reservoir sampling is impossible unless you know the population size ahead of time.据此：迭代或惰性水库采样惰性水库采样是不可能的，除非您提前知道人口规模。

Even so, I'm not understanding why (operationally speaking) the below sampleReservoir doesn't work on infinite lists.即便如此，我还是不明白为什么（从操作上讲）下面的sampleReservoir对无限列表不起作用。 Just where exactly is laziness broken?懒惰究竟在哪里被打破？

import System.Random (randomRIO)

-- equivalent to python's enumerate
enumerate :: (Num i, Enum i) => i -> [e] -> [(i, e)]
enumerate start = zip [start..]

sampleReservoir stream = 
    foldr 
        (\(i, e) reservoir -> do 
            r <- randomRIO (0.0, 1.0) :: IO Double
                              -- randomRIO gets confused about 0.0 and 1.0
            if r < (1.0 / fromIntegral i) then
                fmap (e:) reservoir
            else 
                reservoir) 
        (return []) 
        (enumerate 1 stream)

The challenge and test is fmap (take 1) $ sampleReservoir [1..] .挑战和测试是fmap (take 1) $ sampleReservoir [1..] 。

Furthermore, if reservoir sampling can't be lazy, what can take in a lazy list and produce a sampled lazy list?此外，如果水库采样不能是惰性的，那么什么可以接收惰性列表并生成采样的惰性列表？

I get the idea that there must be a way of making the above function lazy in the output as well, because I could change this:我的想法是，必须有一种方法使上述函数在输出中也变得懒惰，因为我可以改变它：

if r < (1.0 / fromIntegral i) then
    fmap (e:) reservoir
else

To:到：

if r < (1.0 / fromIntegral i) then
    do 
        print e
        fmap (e:) reservoir

This shows results as the function is iterating over the list.这显示了函数迭代列表时的结果。 Using coroutine abstraction, perhaps instead of print e there can be a yield e , and the rest of the computation can be held as a continuation.使用协程抽象，也许可以有一个yield e代替print e ，并且计算的其余部分可以作为延续。

Answer 1

The problem is that the IO monad maintains a strict sequence between actions.问题是 IO monad 在动作之间保持严格的顺序。 Writing fmap (e:) reservoir will first execute all of the effects associated with reservoir , which will be infinite if the input list is infinite.编写fmap (e:) reservoir将首先执行与reservoir相关的所有效果，如果输入列表是无限的，则效果将是无限的。

I was able to fix this with liberal use of unsafeInterleaveIO , which allows you to break the semantics of IO :我能够通过自由使用unsafeInterleaveIO来解决这个unsafeInterleaveIO ，它允许你打破IO的语义：

sampleReservoir2 :: [e] -> IO [e]
sampleReservoir2 stream = 
    foldr 
        (\(i, e) reservoir -> do 
            r <- unsafeInterleaveIO $ randomRIO (0.0, 1.0) :: IO Double -- randomRIO gets confused about 0.0 and 1.0
            if r < (1.0 / fromIntegral i) then unsafeInterleaveIO $ do
                rr <- reservoir
                return (e:rr)
            else 
                reservoir) 
        (return []) 
        (enumerate 1 stream)

Obviously, this will allow the interleaving of IO actions, but since all you're doing is generating random numbers it shouldn't matter.显然，这将允许 IO 操作的交错，但由于您所做的只是生成随机数，因此无关紧要。 However, this solution isn't very satisfactory;然而，这个解决方案并不是很令人满意； the correct solution is to refactor your code somewhat.正确的解决方案是稍微重构您的代码。 You should generate an infinite list of random numbers, then consume that infinite list (lazily) with foldr :您应该生成一个无限的随机数列表，然后使用foldr使用该无限列表（懒惰地）：

sampleReservoir3 :: MonadRandom m => [a] -> m [a]
sampleReservoir3 stream = do
  ws <- getRandomRs (0, 1 :: Double) 
  return $ foldr 
     (\(w, (i, e)) reservoir -> 
        (if w < (1 / fromIntegral i) then (e:) else id) reservoir
     ) 
     []
     (zip ws $ enumerate 1 stream)

This can also (equivalently) be written as这也可以（等价地）写成

sampleReservoir4 :: [a] -> IO [a] 
sampleReservoir4 stream = do
  seed <- newStdGen 
  let ws = randomRs (0, 1 :: Double) seed 
  return $ foldr 
     (\(w, (i, e)) reservoir -> 
        (if w < (1 / fromIntegral i) then (e:) else id) reservoir
     ) 
     []
     (zip ws $ enumerate 1 stream)

As an aside, I'm not sure as to the correctness of the algorithm, since it seems to always return the first element of the input list first.顺便说一句，我不确定算法的正确性，因为它似乎总是首先返回输入列表的第一个元素。 Not very random.不是很随意。

Haskell 中的无限/延迟水库采样

问题描述

1 个解决方案

解决方案1
4 已采纳 2015-11-25 16:30:04

Haskell 中的无限/延迟水库采样

问题描述

1 个解决方案

解决方案1 4 已采纳 2015-11-25 16:30:04

解决方案1
4 已采纳 2015-11-25 16:30:04