简体   繁体   English

并行文件处理:推荐的方法是什么?

[英]Parallel File Processing: What are recommended ways?

This is by large combination of design and code problem. 这是设计和代码问题的大量组合。

Use Case 用例
- Given many log files in range (2MB - 2GB), I need to parse each of these logs and apply some processing, generate Java POJO . - 鉴于范围内的许多日志文件(2MB - 2GB),我需要解析每个日志并应用一些处理,生成Java POJO
- For this problem, lets assume that we have just 1 log file - 对于这个问题,我们假设我们只有1日志文件
- Also, the idea is to making best use of System. - 此外,我们的想法是充分利用系统。 Multiple cores are available. 可以使用多个内核。

Alternative 1 备选方案1
- Open file (synchronous), read each line, generate POJO s - 打开文件(同步),读取每一行,生成POJO

FileActor -> read each line -> List<POJO>  

Pros : simple to understand 优点 :简单易懂
Cons : Serial Process, not taking advantage of multiple cores in the system 缺点 :串行过程,没有利用系统中的多个核心

Alternative 2 备选方案2
- Open File (synchronous), read N lines ( N is configurable), pass on to different actors to process - 打开文件(同步),读取N行( N是可配置的),传递给不同的actor进行处理

                                                    / LogLineProcessActor 1
FileActor -> LogLineProcessRouter (with 10 Actors) -- LogLineProcessActor 2
                                                    \ LogLineProcessActor 10

Pros Some parallelization, by using different actors to process part of lines. 优点一些并行化,通过使用不同的actor来处理部分行。 Actors will make use of available cores in the system (? how, may be?) 演员将利用系统中的可用核心(?如何,可能?)
Cons Still Serial, because file read in serial fashion Cons Still Serial,因为文件以串行方式读取

Questions 问题
- is any of the above choice a good choice? - 以上任何一种选择都是不错的选择?
- Are there better alternatives? - 有更好的选择吗?

Please provide valuable thoughts here 请在此提供有价值的想法

Thanks a lot 非常感谢

Why not take advantage of what's already available, and use the paralell stream stuff, that comes with jdk 1.8? 为什么不利用已经可用的东西,并使用jdk 1.8附带的并行流内容? I would start with something like this, and see how it performs: 我会从这样的事情开始,看看它是如何表现的:

Files.lines(Paths.get( /* path to a log file */ ))
     .parallel() // make the stream work paralell
     .map(YourBean::new) // Or some mapping method to your bean class
     .forEach(/* process here the beans*/);

You may need some tweaks with the thread pooling, because paralell() by default is executed using ForkJoinPool.commonPool() , and you can't really customize it to achieve maximum performance, but people seem to find workarounds for that too, some stuff about the topic here . 您可能需要一些调整与线程池,因为paralell()使用执行默认ForkJoinPool.commonPool()并且你不能真正对其进行自定义,以实现最大的性能,但人们似乎找到该变通方法也一样, 有些东西关于这里的主题

The alternative 2 looks good. 替代方案2看起来不错。 I would just change a thing. 我会改变一件事。 Read the biggest chunk of file you can. 阅读最大的文件块。 IO will be a problem if you do it in small bursts. 如果你以小爆发的方式进行IO,那将是一个问题。 As there are several files, I would create an actor to get the name of the files, reading a particular folder. 由于有几个文件,我会创建一个actor来获取文件的名称,读取一个特定的文件夹。 Then it will send the path to each file to the LogLineReader . 然后它会将每个文件的路径发送到LogLineReader It will read a big chunk of the file. 它会读取文件的一大块。 And finally it will send each line to the LogLineProcessActor . 最后它会将每一行发送到LogLineProcessActor Be aware that they may process the lines out of order. 请注意,他们可能会无序处理这些行。 If that is not a problem, they will keep your CPU busy. 如果这不是问题,它们将使您的CPU忙碌。

If you feel adventurous, you could also try the new akka stream 1.0 . 如果你喜欢冒险,你也可以尝试新的akka流1.0

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM