简体   繁体   English

带有块和 zipWithIndex 的 scalaz-stream 中令人费解的行为

[英]Puzzling behavior in scalaz-stream with chunk and zipWithIndex

I am trying to process a stream of data using scalaz-stream with an expensive operation※.我正在尝试使用 scalaz-stream 和昂贵的操作来处理数据流※。

scala> :paste
// Entering paste mode (ctrl-D to finish)

    def expensive[T](x:T): T = {
      println(s"EXPENSIVE! $x")
      x
    }
    ^D
// Exiting paste mode, now interpreting.

expensive: [T](x: T)T

※Yes, yes, I know mixing in code with side-effects is bad functional programming style. ※是的,是的,我知道将代码与副作用混合在一起是糟糕的函数式编程风格。 The print statements are just to track the number of times expensive() gets called.)打印语句只是为了跟踪调用 price() 的次数。)

Before passing the data to the expensive operation, I first need to split it into chunks.在将数据传递给昂贵的操作之前,我首先需要将其拆分为块。

scala> val chunked: Process[Task,Vector[Int]] = Process.range(0,4).chunk(2)
chunked: scalaz.stream.Process[scalaz.concurrent.Task,Vector[Int]] = Await(scalaz.concurrent.Task@7ef516f3,<function1>,Emit(SeqView(...),Halt(scalaz.stream.Process$End$)),Emit(SeqView(...),Halt(scalaz.stream.Process$End$)))

scala> chunked.runLog.run
res1: scala.collection.immutable.IndexedSeq[Vector[Int]] = Vector(Vector(0, 1), Vector(2, 3), Vector())

Then I map the expensive operation onto the stream of chunks.然后我将昂贵的操作映射到块流上。

scala> val processed = chunked.map(expensive)
processed: scalaz.stream.Process[scalaz.concurrent.Task,Vector[Int]] = Await(scalaz.concurrent.Task@7ef516f3,<function1>,Emit(SeqViewM(...),Halt(scalaz.stream.Process$End$)),Emit(SeqViewM(...),Halt(scalaz.stream.Process$End$)))

When I execute this, it calls expensive() the expected number of times:当我执行此操作时,它会按预期次数调用昂贵的():

scala> processed.runLog.run
EXPENSIVE! Vector(0, 1)
EXPENSIVE! Vector(2, 3)
EXPENSIVE! Vector()
res2: scala.collection.immutable.IndexedSeq[Vector[Int]] = Vector(Vector(0, 1), Vector(2, 3), Vector())

However, if I chain a call to zipWithIndex, expensive() gets called many more times:但是,如果我将调用链接到 zipWithIndex,则会多次调用昂贵的():

>scala processed.zipWithIndex.runLog.run
EXPENSIVE! Vector()
EXPENSIVE! Vector()
EXPENSIVE! Vector()
EXPENSIVE! Vector()
EXPENSIVE! Vector(0)
EXPENSIVE! Vector(0)
EXPENSIVE! Vector(0)
EXPENSIVE! Vector(0)
EXPENSIVE! Vector(0, 1)
EXPENSIVE! Vector()
EXPENSIVE! Vector()
EXPENSIVE! Vector()
EXPENSIVE! Vector()
EXPENSIVE! Vector(2)
EXPENSIVE! Vector(2)
EXPENSIVE! Vector(2)
EXPENSIVE! Vector(2)
EXPENSIVE! Vector(2, 3)
EXPENSIVE! Vector()
EXPENSIVE! Vector()
EXPENSIVE! Vector()
EXPENSIVE! Vector()
EXPENSIVE! Vector()
EXPENSIVE! Vector()
res3: scala.collection.immutable.IndexedSeq[(Vector[Int], Int)] = Vector((Vector(0, 1),0), (Vector(2, 3),1), (Vector(),2))

Is this a bug?这是一个错误吗? If it is the desired behavior, can anybody explain why?如果这是所需的行为,有人可以解释原因吗? If expensive() takes a long time, you can see why I would prefer the result with fewer calls.如果昂贵的()需要很长时间,您就会明白为什么我更喜欢调用更少的结果。

Here is a gist with more examples: https://gist.github.com/underspecified/11279251这是一个包含更多示例的要点: https : //gist.github.com/underspecified/11279251

You're seeing this issue , which can take a number of different forms .您看到了这个问题,它可以有多种不同的形式 The problem is essentially that map can see (and do stuff with) the intermediate steps that chunk is taking while it builds up its results.问题本质上是map可以看到(并做一些事情) chunk在构建结果时所采取的中间步骤。

This behaviormay change in the future , but in the meantime there are a couple of possible workarounds.这种行为将来可能会改变,但同时有几种可能的解决方法。 One of the simplest is to wrap your expensive function in a process and use flatMap instead of map :最简单的方法之一是将昂贵的函数包装在一个进程中并使用flatMap而不是map

chunked.flatMap(a =>
  Process.eval(Task.delay(expensive(a)))
).zipWithIndex.runLog.run

Another solution is to wrap your expensive function in an effectful channel:另一种解决方案是将昂贵的函数包装在一个有效的通道中:

def expensiveChannel[A] = Process.constant((a: A) => Task.delay(expensive(a)))

Now you can use through :现在您可以使用through

chunked.through(expensiveChannel).zipWithIndex.runLog.run

While the current behavior can be a little surprising, it's also a good reminder that you should be using the type system to help you track all the effects you care about (and long-running computation can be one of these).虽然当前的行为可能有点令人惊讶,但它也是一个很好的提醒,您应该使用类型系统来帮助您跟踪您关心的所有影响(并且长时间运行的计算可能是其中之一)。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM