简体   繁体   English

如何加快scalaz-stream文本处理速度?

[英]How can I speed up scalaz-stream text processing?

How can I speed up the following scalaz-stream code? 如何加速以下scalaz-stream代码? Currently it takes about 5 minutes to process 70MB of text, so I am probably doing something quite wrong, since a plain scala equivalent would take a few seconds. 目前处理70MB文本大约需要5分钟,所以我可能做了一些非常错误的事情,因为一个普通的scala等价物需要几秒钟。

(follow-up to another question ) (后续另一个问题

  val converter2: Task[Unit] = {
    val docSep = "~~~"
    io.linesR("myInput.txt")
      .flatMap(line => { val words = line.split(" ");
          if (words.length==0 || words(0)!=docSep) Process(line)
          else Process(docSep, words.tail.mkString(" ")) })
      .split(_ == docSep)
      .filter(_ != Vector())
      .map(lines => lines.head + ": " + lines.tail.mkString(" "))
      .intersperse("\n")
      .pipe(text.utf8Encode)
      .to(io.fileChunkW("correctButSlowOutput.txt"))
      .run
  }

I think you could just use one of the process1 chunk methods to chunk. 我想你可以使用其中一个process1块方法来块。 If you want a lot parallel processing on the merge of the lines into your output format, decide if ordered output is important and use a channel combined with a merge or tee. 如果您希望在将行合并到输出格式时进行大量并行处理,请确定有序输出是否重要,并使用与合并或T形结合的通道。 This will also make it reusable. 这也将使其可重复使用。 Because you are doing a very small amount of processing you are probably swamped with overhead so you have to work harder to make your unit of work large enough not to be swamped. 因为你正在进行非常少量的处理,所以你可能会在头顶上淹没,所以你必须更加努力地使你的工作单元足够大而不会被淹没。

The following is based on @user1763729 's suggestion of chunking. 以下是基于@ user1763729的分块建议。 It feels clunky though, and it's just as slow as the original version. 虽然它感觉很笨,但它和原版一样慢。

  val converter: Task[Unit] = {
    val docSep = "~~~"
    io.linesR("myInput.txt")
      .intersperse("\n") // handle empty documents (chunkBy has to switch from true to false)
      .zipWithPrevious // chunkBy cuts only *after* the predicate turns false
      .chunkBy{ 
        case (Some(prev), line) => { val words = line.split(" "); words.length == 0 || words(0) != docSep } 
        case (None, line) => true }
      .map(_.map(_._1.getOrElse(""))) // get previous element
      .map(_.filter(!Set("", "\n").contains(_)))
      .map(lines => lines.head.split(" ").tail.mkString(" ") + ": " + lines.tail.mkString(" "))
      .intersperse("\n")
      .pipe(text.utf8Encode)
      .to(io.fileChunkW("stillSlowOutput.txt"))
      .run
  }

EDIT: 编辑:

Actually, doing the following (just reading the file, no writing or processing) already takes 1.5 minutes, so I guess there's not much hope to speed this up. 实际上,执行以下操作(只是阅读文件,没有写入或处理)已经需要1.5分钟,所以我想没有太多希望加快这一点。

  val converter: Task[Unit] = {
    io.linesR("myInput.txt")
      .pipe(text.utf8Encode)
      .run
  }

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM