简体   繁体   English

通过Java中的套接字发送(并行)流

[英]Send a (parallel) stream through a socket in Java

I'm trying to write a distributed Map-reduce program in java, using the Java 8 map-reduce framework, that has the following design: 我正在尝试使用Java 8 map-reduce框架在Java中编写一个分布式Map-reduce程序,该程序具有以下设计:

One client sends data to 3 Mappers (each a different machine/standalone java application). 一个客户端将数据发送到3个映射器(每个映射器都使用不同的机器/独立的Java应用程序)。 The mappers work with the data in parallel by creating a parallelStream() from the List of data. 映射器通过从数据列表创建parallelStream()parallelStream()处理数据。

Now each mapper should call .map(...) on its parallelStream. 现在,每个映射器都应在其parallelStream上调用.map(...) Then, the idea is to send the mapped data to another node, the Reducer. 然后,想法是将映射的数据发送到另一个节点,Reducer。

The reducer will get the Stream and call .reduce(...) on it and finally .get() to get the final results, that are sent back to the client. reducer将获取Stream并在其上调用.reduce(...) ,最后是.get()以获取最终结果,然后将其发送回客户端。

My program works if I call .map(...).reduce(...).get() on the same program, but I want to be able to have a separate reducer node. 如果在同一程序上调用.map(...).reduce(...).get() ,则我的程序可以工作,但是我希望能够有一个单独的reducer节点。

As I am new in socket programming and also at using Java 8, I'm having trouble sending the stream through the Socket, because it throws a "java.io.NotSerializableException: java.util.stream.ReferencePipeline$3" the moment I try to write the stream with WriteObject . 由于我是套接字编程的新手,也正在使用Java 8,因此在通过套接字发送流时遇到了麻烦,因为在尝试时,它会抛出“ java.io.NotSerializableException:java.util.stream.ReferencePipeline $ 3”用WriteObject写入流。

What's the best way to proceed here? 进行此操作的最佳方法是什么? Can I turn the stream into something else, send it and then turn it into a stream again on my Reducer node? 我可以将流转换成其他东西,发送然后在我的Reducer节点上再次将其转换成流吗? Is there a better way to send the stream than through an ObjectOutputStream ? 有没有比通过ObjectOutputStream发送流更好的方法?

Any ideas are very much appreciated. 任何想法都非常感谢。 Thank you very much! 非常感谢你!

PS: The stream is a Stream<Map<String, Integer>> . PS:流是Stream<Map<String, Integer>>

One approach is to terminate the map node with a forEach that pushes data into the socket. 一种方法是使用forEach终止映射节点,该forEach将数据推入套接字。 This strategy is superior to a collection approach if the collection could be very large (or theoretically infinite); 如果集合可能很大(理论上是无限的),则此策略优于集合方法。 it's space efficient, it's buffered, and down stream nodes are not idle waiting for collection process to complete. 它节省空间,具有缓冲,并且下游节点不会空闲,以等待收集过程完成。

Next wrap the socket reader for the reduce node in a Spliterator (extend AbstractSpliterator). 接下来,将套接字读取器包装到Spliterator中的reduce节点中(扩展AbstractSpliterator)。 The tryAdvance method of Spliterator reads data from the socket and makes it available to the stream through a caller provided Consumer. Spliterator的tryAdvance方法从套接字读取数据,并通过调用方提供的使用者将其提供给流。 tryAdvance returns false when there is no more data (your end-of-stream marker, socket end of stream, or socket exception). 当没有更多数据(您的流结束标记,流的套接字结束或套接字异常)时,tryAdvance返回false。 AbstractSpliterator.trySplit implements limited parallelism. AbstractSpliterator.trySplit实现了有限的并行性。

Use StreamSupport.stream(Spliterator spliterator, boolean parallel) to construct a stream from your Spliterator implementation. 使用StreamSupport.stream(Spliterator spliterator,boolean parallel)从您的Spliterator实现中构造一个流。 Your reduce operation pulls its data from this stream. 您的reduce操作将从此流中提取其数据。

You could retain the socket and your end-of-stream marker could be more like an end-of-message marker (reminds me of a batching pig in a pipeline). 您可以保留套接字,流结束标记可能更像消息结束标记(让我想起了管道中的分批清管器)。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM