使用scalaz-stream进行行计数的性能

Question

I've translated the imperative line counting code (see linesGt1 ) from the beginning of chapter 15 of Functional Programming in Scala to a solution that uses scalaz-stream (see linesGt2 ). 我已经将命令行计数代码（参见linesGt1 ）从Scala中的函数编程的第15章开头翻译成使用scalaz-stream的解决方案（参见linesGt2 ）。 The performance of linesGt2 however is not that great. 然而，线linesGt2的表现并不是那么好。 The imperative code is about 30 times faster than my scalaz-stream solution. 命令式代码比我的scalaz-stream解决方案快约30倍。 So I guess I'm doing something fundamentally wrong. 所以我想我做的事情从根本上说是错误的。 How can the performance of the scalaz-stream code be improved? 如何改进scalaz-stream代码的性能？

Here is my complete test code: 这是我完整的测试代码：

import scalaz.concurrent.Task
import scalaz.stream._

object Test06 {

val minLines = 400000

def linesGt1(filename: String): Boolean = {
  val src = scala.io.Source.fromFile(filename)
  try {
    var count = 0
    val lines: Iterator[String] = src.getLines
    while (count <= minLines && lines.hasNext) {
      lines.next
      count += 1
    }
    count > minLines
  }
  finally src.close
}

def linesGt2(filename: String): Boolean =
  scalaz.stream.io.linesR(filename)
    .drop(minLines)
    .once
    .as(true)
    .runLastOr(false)
    .run

def time[R](block: => R): R = {
  val t0 = System.nanoTime()
  val result = block
  val t1 = System.nanoTime()
  println("Elapsed time: " + (t1 - t0) / 1e9 + "s")
  result
}

time(linesGt1("/home/frank/test.txt"))        //> Elapsed time: 0.153122057s
                                              //| res0: Boolean = true
time(linesGt2("/home/frank/test.txt"))        //> Elapsed time: 4.738644606s
                                              //| res1: Boolean = true
}

Answer 1

When you are doing profiling or timing, you can use Process.range to generate your inputs to isolate your actual computation from the I/O. 在进行性能分析或计时时，可以使用Process.range生成输入，以将实际计算与I / O隔离。 Adapting your example: 调整你的例子：

time { Process.range(0,100000).drop(40000).once.as(true).runLastOr(false).run }

When I first ran this, it took about 2.2 seconds on my machine, which seems consistent with what you were seeing. 当我第一次运行时，我的机器花了大约2.2秒，这看起来与你所看到的一致。 After a couple runs, probably after JIT'ing, I was consistently getting around .64 seconds, and in principle, I don't see any reason why it couldn't be just as fast even with I/O (see discussion below). 经过一段时间的运行，可能是在JIT之后，我一直在64秒左右，原则上，我没有看到任何理由为什么即使使用I / O它也不会那么快（见下面的讨论）。

In my informal testing, the overhead per 'step' of scalaz-stream seems to be about 1-2 microseconds (for instance, try Process.range(0,10000) . If you have a pipeline with multiple stages, then each step of the overall stream will consist of several other steps. The way to think about minimizing the overhead of scalaz-stream is just to make sure that you're doing enough work at each step to dwarf any overhead added by scalaz-stream itself. This post has more details on this approach . The line counting example is kind of a worst case, since you are doing almost no work per step and are just counting the steps. 在我的非正式测试中，scalaz-stream的每个'step'的开销似乎约为1-2微秒（例如，尝试Process.range(0,10000) 。如果你有一个包含多个阶段的管道，那么每个步骤都是整个流将包含其他几个步骤。考虑减少scalaz-stream开销的方法只是为了确保你在每一步都做足够的工作来使scalaz-stream本身添加的任何开销相形见绌。这篇文章有关这种方法的更多细节。行计数示例是最糟糕的情况，因为您每步几乎不做任何工作，只是计算步骤。

So I would try writing a version of linesR that reads multiple lines per step, and also make sure you do your measurements after JIT'ing. 因此，我会尝试编写一个版本的linesR ，每步读取多行，并确保在JIT之后进行测量。

使用scalaz-stream进行行计数的性能

问题描述

1 个解决方案

解决方案1
2 已采纳 2013-09-18 14:26:25

使用scalaz-stream进行行计数的性能

问题描述

1 个解决方案

解决方案1 2 已采纳 2013-09-18 14:26:25

解决方案1
2 已采纳 2013-09-18 14:26:25