简体   繁体   English

Scala,读取文件,处理行并使用并发(akka),异步API(nio2)将输出写入新文件

[英]Scala, read file, process lines and write output to a new file using concurrent (akka), asynchronous APIs (nio2)

1: I'm running into a problem trying to process a large text file - 10Gigs+ 1:我遇到了一个试图处理大文本文件的问题 - 10Gigs +

Single thread solution is the following: 单线程解决方案如下:

val writer = new PrintWriter(new File(output.getOrElse("output.txt")));
for(line <- scala.io.Source.fromFile(file.getOrElse("data.txt")).getLines())
{
  writer.println(DigestUtils.HMAC_SHA_256(line))
}
writer.close()

2: I tried concurrent processing using 2:我尝试使用并发处理

val futures = scala.io.Source.fromFile(file.getOrElse("data.txt")).getLines
               .map{ s => Future{ DigestUtils.HMAC_SHA_256(s) } }.to
val results = futures.map{ Await.result(_, 10000 seconds) }

This yields in a GC overhead limit exceeded exception (see Appendix A for stacktrace) 这会导致GC开销限制超出异常(有关stacktrace,请参阅附录A)

3: I tried using Akka IO with combination of AsynchronousFileChannel following https://github.com/drexin/akka-io-file I am able to read the file in byte chunks using FileSlurp but have not been able to find a solution to read file by lines which is a requirement. 3:我尝试使用Akka IO和AsynchronousFileChannel的组合以下https://github.com/drexin/akka-io-file我能够使用FileSlurp以字节块的形式读取文件但是找不到要读取的解决方案按行要求的文件。

Any help would be greatly appreciated. 任何帮助将不胜感激。 Thank you. 谢谢。

APPENDIX A 附录A.

[error] (run-main) java.lang.OutOfMemoryError: GC overhead limit exceeded
java.lang.OutOfMemoryError: GC overhead limit exceeded
        at java.nio.CharBuffer.wrap(Unknown Source)
        at sun.nio.cs.StreamDecoder.implRead(Unknown Source)
        at sun.nio.cs.StreamDecoder.read(Unknown Source)
        at java.io.InputStreamReader.read(Unknown Source)
        at java.io.BufferedReader.fill(Unknown Source)
        at java.io.BufferedReader.readLine(Unknown Source)
        at java.io.BufferedReader.readLine(Unknown Source)
        at scala.io.BufferedSource$BufferedLineIterator.hasNext(BufferedSource.s
cala:67)
        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
        at scala.collection.Iterator$class.foreach(Iterator.scala:727)
        at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
        at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:
48)
        at scala.collection.immutable.VectorBuilder.$plus$plus$eq(Vector.scala:7
16)
        at scala.collection.immutable.VectorBuilder.$plus$plus$eq(Vector.scala:6
92)
        at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
        at scala.collection.AbstractIterator.to(Iterator.scala:1157)
        at com.test.Twitterhashconcurrentcli$.doConcurrent(Twitterhashconcu
rrentcli.scala:35)
        at com.test.Twitterhashconcurrentcli$delayedInit$body.apply(Twitter
hashconcurrentcli.scala:62)
        at scala.Function0$class.apply$mcV$sp(Function0.scala:40)
        at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:
12)
        at scala.App$$anonfun$main$1.apply(App.scala:71)
        at scala.App$$anonfun$main$1.apply(App.scala:71)
        at scala.collection.immutable.List.foreach(List.scala:318)
        at scala.collection.generic.TraversableForwarder$class.foreach(Traversab
leForwarder.scala:32)
        at scala.App$class.main(App.scala:71)

The trick here is to avoid potentially reading all the data into memory at once. 这里的诀窍是避免将所有数据一次性读入内存。 If you iterate and send lines to workers, you run this risk because sending to an actor is async so you might read all the data into memory and it will sit in the mailboxes of the actors probably leading to an OOM exception. 如果您迭代并向工作人员发送行,则会冒这个风险,因为发送给actor是异步的,因此您可能会将所有数据读入内存,并且它将位于actor的邮箱中,可能会导致OOM异常。 A better high level approach would be to use a single master actor and a pool of child workers underneath it for the processing. 更好的高级方法是使用单个主操作员和下面的子工作池来进行处理。 The trick here is to use a lazy stream over the file (like the Iterator returned from scala.io.Source.fromX ) in the master and then use a work-pulling pattern in the workers to prevent their mailboxes with filling up with data. 这里的技巧是在主文件中使用一个惰性流(如scala.io.Source.fromX返回的Iterator ),然后在scala.io.Source.fromX中使用工作拉动模式来防止他们的邮箱填满数据。 Then, when the iterator no longer has any lines, the master stops itself and that will stop the workers (and if necessary, you can also use this point to shutdown the actor system if that's what you really want to do). 然后,当迭代器不再有任何行时,主服务器会自行停止,这将停止工作者(如果需要,您也可以使用此点关闭actor系统,如果这是你真正想做的事情)。

Here is a very rough outline. 这是一个非常粗略的轮廓。 Please note that I did not test this yet: 请注意,我还没有测试过这个:

import akka.actor._
import akka.routing.RoundRobinLike
import akka.routing.RoundRobinRouter
import scala.io.Source
import akka.routing.Broadcast

object FileReadMaster{
  case class ProcessFile(filePath:String)
  case class ProcessLines(lines:List[String], last:Boolean = false)
  case class LinesProcessed(lines:List[String], last:Boolean = false)

  case object WorkAvailable
  case object GimmeeWork
}

class FileReadMaster extends Actor{
  import FileReadMaster._

  val workChunkSize = 10
  val workersCount = 10

  def receive = waitingToProcess

  def waitingToProcess:Receive = {
    case ProcessFile(path) =>
      val workers = (for(i <- 1 to workersCount) yield context.actorOf(Props[FileReadWorker])).toList
      val workersPool = context.actorOf(Props.empty.withRouter(RoundRobinRouter(routees = workers)))
      val it = Source.fromFile(path).getLines
      workersPool ! Broadcast(WorkAvailable)
      context.become(processing(it, workersPool, workers.size))

      //Setup deathwatch on all
      workers foreach (context watch _)
  }

  def processing(it:Iterator[String], workers:ActorRef, workersRunning:Int):Receive = {
    case ProcessFile(path) => 
      sender ! Status.Failure(new Exception("already processing!!!"))


    case GimmeeWork if it.hasNext =>
      val lines = List.fill(workChunkSize){
        if (it.hasNext) Some(it.next)
        else None
      }.flatten

      sender ! ProcessLines(lines, it.hasNext)

      //If no more lines, broadcast poison pill
      if (!it.hasNext) workers ! Broadcast(PoisonPill)

    case GimmeeWork =>
      //get here if no more work left

    case LinesProcessed(lines, last) =>
      //Do something with the lines

    //Termination for last worker
    case Terminated(ref)  if workersRunning == 1 =>
      //Done with all work, do what you gotta do when done here

    //Terminared for non-last worker
    case Terminated(ref) =>
      context.become(processing(it, workers, workersRunning - 1))

  }
}

class FileReadWorker extends Actor{
  import FileReadMaster._

  def receive = {
    case ProcessLines(lines, last) => 
      sender ! LinesProcessed(lines.map(_.reverse), last)
      sender ! GimmeeWork

    case WorkAvailable =>
      sender ! GimmeeWork
  }
}

The idea is that the master iterates over a file's contents and sends chunks of work to a pool of child workers. 这个想法是主人迭代文件的内容并将一大堆工作发送给一个童工池。 When the file processing starts, the master tells all the children that work is available. 文件处理开始时,主人告诉所有孩子工作可用。 Each child then continues to request work until there is no more work left. 然后每个孩子继续请求工作,直到不再有工作为止。 When the master detects that the file is done being read, it broadcasts a poison pill to the children which will let them finish any outstanding work and then stop. 当主人检测到文件被读完时,它会向孩子们播放一个毒丸,让他们完成任何未完成的工作,然后停止。 When all of the children stop, the master can finish whatever cleanup is needed. 当所有孩子都停下来时,主人可以完成所需的任何清理工作。

Again, this is very rough based on what I think you are asking about. 同样,根据我的想法,这是非常粗略的。 If I'm off in any area, let me know and I can amend the answer. 如果我在任何地区离开,请告诉我,我可以修改答案。

In fact, in parallel variant, you are trying to first read all the file into memory, as a list of lines, and then to take a copy (with method List.to). 实际上,在并行变体中,您试图首先将所有文件读入内存,作为行列表,然后进行复制(使用方法List.to)。 Evidently, this cause OOME. 显然,这导致了OOME。

To parallelize, first decide it is worth doing. 要并行化,首先要确定它值得做。 You should not parallelize reading from sequential file (as well as writing): this only cause excessive moving of magnetic heads and makes things slow. 您不应该从顺序文件(以及写入)并行读取:这只会导致磁头过度移动并使速度变慢。 Parallelization only make sense if DigestUtils.HMAC_SHA_256(s) takes comparable or greater time than reading a line. 如果DigestUtils.HMAC_SHA_256(s)花费的时间与读取行相当或更长,则并行化才有意义。 Make a benchmarks to measure both times. 制作基准来衡量两次。 Then, if you decide that parallelization of hash code computation is worth doing, find out the number of working threads: idea is that elapsed computational time be roughly equal to reading time. 然后,如果你决定哈希码计算的并行化是值得做的,找出工作线程的数量:想法是经过的计算时间大致等于读取时间。 Let one thread read lines, pack them in batches (say 1000 lines in a batch), and puts batches in an ArrayBlockingQueue of fixed size (say, 1000). 让一个线程读取行,批量打包(比如一批1000行),并将批量放入固定大小的ArrayBlockingQueue (比如1000)。 Batching is required because there are too many lines and so too many synchronized operations on the queue, causing contention. 需要进行批处理,因为队列太多,队列上的同步操作太多,导致争用。 Let working threads read batches from that queue using method take . 让工作线程使用方法take从该队列中读取批次。

One more thread should write results to "output.txt" , also connected with a blocking queue. 另一个线程应该将结果写入"output.txt" ,也与阻塞队列连接。 If you has to keep order of lines in the output file, then more complex communication facility should be used instead of the second queue, but this is another story. 如果必须在输出文件中保持行的顺序,则应使用更复杂的通信工具而不是第二个队列,但这是另一个故事。

Code below is not tested :) 以下代码未经过测试:)

Mapping to Futures is definitely not a good idea. 映射到期货绝对不是一个好主意。
Instead, as you already using Akka, i'd introduce a special LineProcessor actor and then send lines to it: 相反,当你已经使用Akka时,我会引入一个特殊的LineProcessor actor然后向它发送行:

val processor = system.actorOf(Props(new LineProcessor))

val src = scala.io.Source.fromFile(file.getOrElse("data.txt"))

src.getLines.foreach(line => processor ! line)  

And inside LineProcessor you can encapsulate the logic to process the line: 在LineProcessor中,您可以封装逻辑来处理该行:

class LineProcessor extends Actor {
  def receive {
    case line => // process the line
  }
}    

Here the trick is that with actors you can quite easily scale horizontally. 这里的诀窍是演员可以很容易地水平缩放。 Just wrap a LineProcessor actor inside a Router... 只需将LineProcessor actor包装在路由器中......

// this will create 10 workers to process your lines simultaneously
val processor = system.actorOf(Props(new LineProcessor).withRouter(RoundRobinRouter(10))

One thing worth mentioning is that if you need to write lines somewhere with order preserved it becomes a little bit trickier. 值得一提的是,如果你需要在保留顺序的地方写行,那就变得有点棘手了。 =) (when reading a line from file you need to capture also it's number and when when writing it back you need to coordinate across all workers) =)(当从文件中读取一行时,你需要捕获它的数字,当写回来时你需要协调所有工人)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM