简体   繁体   English

Haskell 中的无限/延迟水库采样

[英]Infinite/Lazy Reservoir Sampling in Haskell

I tried to implement a simple reservoir sampling in haskell following http://jeremykun.com/2013/07/05/reservoir-sampling/ (note that the algorithm shown is possibly semantically incorrect)我尝试在http://jeremykun.com/2013/07/05/reservoir-sampling/ 之后在 haskell 中实现一个简单的水库采样(请注意,显示的算法可能在语义上不正确)

According to this: Iterative or Lazy Reservoir Sampling lazy reservoir sampling is impossible unless you know the population size ahead of time.据此: 迭代或惰性水库采样惰性水库采样是不可能的,除非您提前知道人口规模。

Even so, I'm not understanding why (operationally speaking) the below sampleReservoir doesn't work on infinite lists.即便如此,我还是不明白为什么(从操作上讲)下面的sampleReservoir对无限列表不起作用。 Just where exactly is laziness broken?懒惰究竟在哪里被打破?

import System.Random (randomRIO)

-- equivalent to python's enumerate
enumerate :: (Num i, Enum i) => i -> [e] -> [(i, e)]
enumerate start = zip [start..]

sampleReservoir stream = 
    foldr 
        (\(i, e) reservoir -> do 
            r <- randomRIO (0.0, 1.0) :: IO Double
                              -- randomRIO gets confused about 0.0 and 1.0
            if r < (1.0 / fromIntegral i) then
                fmap (e:) reservoir
            else 
                reservoir) 
        (return []) 
        (enumerate 1 stream)

The challenge and test is fmap (take 1) $ sampleReservoir [1..] .挑战和测试是fmap (take 1) $ sampleReservoir [1..]

Furthermore, if reservoir sampling can't be lazy, what can take in a lazy list and produce a sampled lazy list?此外,如果水库采样不能是惰性的,那么什么可以接收惰性列表并生成采样的惰性列表?

I get the idea that there must be a way of making the above function lazy in the output as well, because I could change this:我的想法是,必须有一种方法使上述函数在输出中也变得懒惰,因为我可以改变它:

if r < (1.0 / fromIntegral i) then
    fmap (e:) reservoir
else 
    

To:到:

if r < (1.0 / fromIntegral i) then
    do 
        print e
        fmap (e:) reservoir

This shows results as the function is iterating over the list.这显示了函数迭代列表时的结果。 Using coroutine abstraction, perhaps instead of print e there can be a yield e , and the rest of the computation can be held as a continuation.使用协程抽象,也许可以有一个yield e代替print e ,并且计算的其余部分可以作为延续。

The problem is that the IO monad maintains a strict sequence between actions.问题是 IO monad 在动作之间保持严格的顺序。 Writing fmap (e:) reservoir will first execute all of the effects associated with reservoir , which will be infinite if the input list is infinite.编写fmap (e:) reservoir将首先执行与reservoir相关的所有效果,如果输入列表是无限的,则效果将是无限的。

I was able to fix this with liberal use of unsafeInterleaveIO , which allows you to break the semantics of IO :我能够通过自由使用unsafeInterleaveIO来解决这个unsafeInterleaveIO ,它允许你打破IO的语义:

sampleReservoir2 :: [e] -> IO [e]
sampleReservoir2 stream = 
    foldr 
        (\(i, e) reservoir -> do 
            r <- unsafeInterleaveIO $ randomRIO (0.0, 1.0) :: IO Double -- randomRIO gets confused about 0.0 and 1.0
            if r < (1.0 / fromIntegral i) then unsafeInterleaveIO $ do
                rr <- reservoir
                return (e:rr)
            else 
                reservoir) 
        (return []) 
        (enumerate 1 stream)

Obviously, this will allow the interleaving of IO actions, but since all you're doing is generating random numbers it shouldn't matter.显然,这将允许 IO 操作的交错,但由于您所做的只是生成随机数,因此无关紧要。 However, this solution isn't very satisfactory;然而,这个解决方案并不是很令人满意; the correct solution is to refactor your code somewhat.正确的解决方案是稍微重构您的代码。 You should generate an infinite list of random numbers, then consume that infinite list (lazily) with foldr :您应该生成一个无限的随机数列表,然后使用foldr使用该无限列表(懒惰地):

sampleReservoir3 :: MonadRandom m => [a] -> m [a]
sampleReservoir3 stream = do
  ws <- getRandomRs (0, 1 :: Double) 
  return $ foldr 
     (\(w, (i, e)) reservoir -> 
        (if w < (1 / fromIntegral i) then (e:) else id) reservoir
     ) 
     []
     (zip ws $ enumerate 1 stream)

This can also (equivalently) be written as这也可以(等价地)写成

sampleReservoir4 :: [a] -> IO [a] 
sampleReservoir4 stream = do
  seed <- newStdGen 
  let ws = randomRs (0, 1 :: Double) seed 
  return $ foldr 
     (\(w, (i, e)) reservoir -> 
        (if w < (1 / fromIntegral i) then (e:) else id) reservoir
     ) 
     []
     (zip ws $ enumerate 1 stream)

As an aside, I'm not sure as to the correctness of the algorithm, since it seems to always return the first element of the input list first.顺便说一句,我不确定算法的正确性,因为它似乎总是首先返回输入列表的第一个元素。 Not very random.不是很随意。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM