简体   繁体   English

Java 8 MapReduce用于分布式计算

[英]Java 8 MapReduce for distributed computing

It made me happy when I heard about parallelStream() in Java 8, that processes on multiple cores and finally gives back the result within single JVM. 当我在Java 8中听到parallelStream()时,它让我感到高兴,它在多个内核上进行处理,最后在单个JVM中返回结果。 No more lines of multithreading code. 没有更多的多线程代码行。 As far as I understand this is valid for single JVM only. 据我所知,这仅对单个JVM有效。

But what if I want to distribute the processing across different JVMs on a single host or even multiple hosts? 但是,如果我想在单个主机甚至多个主机上的不同JVM上分发处理,该怎么办? Does Java 8 include any abstraction for simplifying it? Java 8是否包含任何用于简化它的抽象?

In a tutorial at dreamsyssoft.com a list of users dreamsyssoft.com教程中,用户列表

private static List<User> users = Arrays.asList(
    new User(1, "Steve", "Vai", 40),
    new User(4, "Joe", "Smith", 32),
    new User(3, "Steve", "Johnson", 57),
    new User(9, "Mike", "Stevens", 18),
    new User(10, "George", "Armstrong", 24),
    new User(2, "Jim", "Smith", 40),
    new User(8, "Chuck", "Schneider", 34),
    new User(5, "Jorje", "Gonzales", 22),
    new User(6, "Jane", "Michaels", 47),
    new User(7, "Kim", "Berlie", 60)
);

is processed to get their average age like this: 经过处理以获得他们的平均年龄:

double average = users.parallelStream().map(u -> u.age).average().getAsDouble();

In this case it is processed on single host. 在这种情况下,它在单个主机上处理。

My question is : Can it be processed utilizing multiple hosts? 我的问题是 :它可以使用多个主机进行处理吗?

Eg Host1 processes the list below and returns average1 for five users: 例如, Host1处理下面的列表并返回五个用户的average1

new User(1, "Steve", "Vai", 40),
new User(4, "Joe", "Smith", 32),
new User(3, "Steve", "Johnson", 57),
new User(9, "Mike", "Stevens", 18),
new User(10, "George", "Armstrong", 24),

Similarly Host2 processes the list below and returns average2 for remaining five users: 同样Host2的进程列表下方,并返回average2剩余五个用户:

new User(2, "Jim", "Smith", 40),
new User(8, "Chuck", "Schneider", 34),
new User(5, "Jorje", "Gonzales", 22),
new User(6, "Jane", "Michaels", 47),
new User(7, "Kim", "Berlie", 60)

Finally Host3 computes final result like: 最后Host3计算最终结果,如:

average = (average1 + average2)  / 2

Using distributed architecture it can be solved like remoting. 使用分布式架构,它可以像远程处理一样解决。 Does Java 8 have some simpler way to solve the problem with some abstraction for it? Java 8是否有一些更简单的方法来解决这个问题?

I know frameworks like Hadoop, Akka and Promises solve it. 我知道像Hadoop,Akka和Promises这样的框架可以解决它。 I am talking about pure Java 8. Can I get any docummentation and examples for parallelStream() for multiple hosts? 我在谈论纯Java 8.我可以为多个主机获取parallelStream()任何文档和示例吗?

Here is the list of features scheduled for Java 8 as of September 2013. 以下是截至2013年9月为Java 8安排的功能列表

As you can see, there is no feature dedicated to standardizing distributed computing over a cluster. 如您所见,没有专门用于在群集上标准化分布式计算的功能。 The closest you have is JEP 107 , which builds on the Fork/Join framework in JDK 7 to leverage multi-core CPU's. 最接近的是JEP 107 ,它基于JDK 7中的Fork / Join框架构建,以利用多核CPU。 In Java 8, you will be able to use lambda expressions to perform bulk operations on collections in parallel by dividing the task among multiple processors. 在Java 8中,您将能够使用lambda表达式通过在多个处理器之间划分任务来并行地对集合执行批量操作。

Java 8 is also scheduled to feature JEP 103 , which will also build on Java 7 Fork/Join to sort arrays in parallel. Java 8还计划使用JEP 103 ,它也将构建在Java 7 Fork / Join上以并行排序数组。 Meanwhile, since Fork/Join is clearly a big deal, it evolves further with JEP 155 . 同时,由于Fork / Join显然是一个大问题,它与JEP 155进一步发展。

So there are no core Java 8 abstractions for distributed computing over a cluster--only over multiple cores. 因此,群集上的分布式计算没有核心Java 8抽象 - 仅在多个核心上。 You will need to devise your own solution for real distributed computing using existing facilities. 您需要使用现有设施为真正的分布式计算设计自己的解决方案。

As disappointing as that may be, I would point out that there are still wonderful open-source third party abstractions over Hadoop out there like Cascalog and Apache Spark . 尽管可能令人失望,但我还是会指出,就像CascalogApache Spark一样,Hadoop仍然有很好的开源第三方抽象。 Spark in particular lets you perform operations on your data in a distributed way through the RDD abstraction, which makes it feel like your data is just in a fancy array. Spark特别允许您通过RDD抽象以分布式方式对数据执行操作,这使您感觉您的数据只是在一个花哨的数组中。

But you will have to wait for such things in core Java. 但是你必须在核心Java中等待这些事情。

There is nothing in the documentations/specs that shows that there will be such a feature. 文档/规范中没有任何内容表明会有这样的功能。 But if we think for a moment RMI is the Java solution for distribution and it is pretty straightforward, you could use it as the base for distribution and on the nodes use the core parallelism as you shown. 但是,如果我们考虑一下RMI是用于分发的Java解决方案并且它非常简单,您可以将其用作分发的基础,并且在节点上使用核心并行性,如图所示。

Don't expect such a feature in the core language, as it requires some kind of server to run and manage the different processes. 不要指望核心语言中有这样的功能,因为它需要某种服务器来运行和管理不同的进程。 historically, I don't know of similar solutions that were part of java core. 从历史上看,我不知道属于java核心的类似解决方案。

There are however, some solutions that are similar to what you want. 但是,有些解决方案与您想要的解决方案类似。 One of them is cascading http://www.cascading.org/ , which is a functional style infrastructure to write map reduce programs. 其中一个是级联http://www.cascading.org/ ,这是一个用于编写地图缩减程序的功能样式基础结构。 meaning - the actual code if relatively lightweight (unlike traditional map reduce programs) but it does require maintaining an hadoop infrastructure. 意思是 - 相对轻量级的实际代码(与传统的map reduce程序不同)但它确实需要维护hadoop基础结构。

I'm not sure what will happen with Java 8 since it is too early to tell but there are a couple of open source projects that extend the map reduce capabilities of earlier functional programming languages that run in the JVM to distributed computing environments. 我不确定Java 8会发生什么,因为它还为时过早,但是有一些开源项目将JVM中运行的早期函数编程语言的地图缩减功能扩展到分布式计算环境。

Recently, I took a traditional yet non-trivial Hadoop map reduce job (that takes raw performance data and prepares it for loading into an OLAP cube) and rewrote it in both Clojure running on Cascalog and Scala running on Spark. 最近,我采用了传统但非平凡的Hadoop map reduce工作(它接受原始性能数据并准备加载到OLAP多维数据集中)并在Cascalog上运行的Clojure和Spark上运行的Scala中重写。 I documented my findings in a blog called Distributed Computing and Functional Programming . 我在一个叫做分布式计算和功能编程的博客中记录了我的发现。

These open source projects are mature and ready for prime time. 这些开源项目已经成熟并准备好迎接黄金时段。 They are supported by both Cloudera and Hortonworks. 他们得到了Cloudera和Hortonworks的支持。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM