来自Apache Spark Streaming中JavaRDD的多个元素的单个输出请求

Question

Summary 摘要

My question is about how Apache Spark Streaming can handle an output operation that takes a long time by either improving parallelization or by combining many writes into a single, larger write. 我的问题是有关Apache Spark Streaming如何通过改进并行化或将许多写入合并为一个较大的写入来处理需要很长时间的输出操作。 In this case, the write is a cypher request to Neo4J, but it could apply to other data storage. 在这种情况下，写操作是对Neo4J的密码请求，但它可以应用于其他数据存储。

Environment 环境

I have an Apache Spark Streaming application in Java that writes to 2 datastores: Elasticsearch and Neo4j. 我有一个用Java写的Apache Spark Streaming应用程序，它可以写入2个数据存储：Elasticsearch和Neo4j。 Here are the versions: 这些是版本：

Java 8 Java 8
Apache Spark 2.11 Apache Spark 2.11
Neo4J 3.1.1 Neo4J 3.1.1
Neo4J Java Bolt Driver 1.1.2 Neo4J Java Bolt驱动程序1.1.2

The Elasticsearch output was easy enough as I used the Elasticsearch-Hadoop for Apache Spark library. 当我将Elasticsearch-Hadoop用于Apache Spark库时，Elasticsearch的输出非常容易。

Our Stream 我们的流

Our input is a stream from Kafka received on a particular topic, and I deserialize the elements of the stream through a map function to create a JavaDStream<[OurMessage]> dataStream . 我们的输入是从Kafka接收的关于特定主题的流，我通过map函数反序列化流的元素以创建JavaDStream<[OurMessage]> dataStream 。 I then do transforms on this message to create a cypher query String cypherRequest (using an OurMessage to String transformation) that is sent to a singleton that manages the Bolt Driver connection to Neo4j (I know I should use a connection pool, but maybe that's another question). 然后，我对此消息进行转换，以创建一个密码查询String cypherRequest （使用OurMessage到String的转换），该查询发送到一个单例，该单例管理Bolt Driver与Neo4j的连接（我知道我应该使用连接池，但是也许这是另一个连接池）题）。 The cypher query produces a number of nodes and/or edges based on the contents of OurMessage. 密码查询根据OurMessage的内容生成许多节点和/或边。

The code looks something like the following. 该代码如下所示。

dataStream.foreachRDD( rdd -> {
    rdd.foreach( cypherQuery -> {
        BoltDriverSingleton.getInstance().update(cypherQuery);
    });
});

Possibilities for Optimization 优化的可能性

I have two thoughts about how to improve throughput: 关于如何提高吞吐量，我有两个想法：

I am not sure if Spark Streaming parallelization goes down to the RDD element level. 我不确定Spark Streaming并行化是否下降到RDD元素级别。 Meaning, the output of RDDs can be parallelized (within `stream.foreachRDD()`, but can each element of the RDD be parallelized (within `rdd.foreach()`). If the latter were the case, would a `reduce` transformation on our `dataStream` increase the ability for Spark to output this data in parallel (each JavaRDD would contain exactly one cypher query)? 意思是，RDD的输出可以并行化（在stream.foreachRDD（）内，但RDD的每个元素可以并行化在rdd.foreach（）内），如果后者是这种情况，将减少在我们的`dataStream`上进行`转换增加了Spark并行输出此数据的能力（每个JavaRDD都将恰好包含一个密码查询）？
Even with improved parallelization, our performance would further increase if I could implement some sort of Builder that takes each element of the RDD to create a single cypher query that adds the nodes/edges from all elements, instead of one cypher query for each RDD. 即使使用改进的并行化，如果我可以实现某种能够将RDD的每个元素创建一个单独的密码查询来添加所有元素的节点/边缘的Builder，而不是为每个RDD进行一个密码查询，我们的性能将进一步提高。 But, how would I be able to do this without using another kafka instance, which may be overkill? 但是，如何在不使用另一个kafka实例的情况下做到这一点呢？

Am I over thinking this? 我在想这个吗？ I've tried to research so much that I might be in too deep. 我已经尝试了很多研究，以至于我可能太深了。

Aside: I apologize in advance if any of this is completely wrong. 旁白：如果其中任何一项完全错误，我谨此致歉。 You don't know what you don't know, and I've just started working with Apache Spark and Java 8 w/ lambdas. 您不知道不知道什么，而我刚刚开始使用Apache Spark和带有lambda的Java 8。 As Spark users must know by now, either Spark has a steep learning curve due to it's very different paradigm, or I'm an idiot :). 正如Spark用户现在必须知道的那样，要么由于Spark模式不同而导致Spark学习曲线陡峭，要么我是个白痴:)。

Thanks to anyone who might be able to help; 感谢任何可能提供帮助的人； this is my first StackOverflow question in a long time, so please leave feedback and I will be responsive and correct this question as needed. 这是很长时间以来我的第一个StackOverflow问题，所以请留下反馈，我会响应并根据需要纠正此问题。

Answer 1

I think all we need is a simple Map/Reduce. 我认为我们所需要的只是一个简单的Map / Reduce。 The following should allow us to parse each message in the RDD and then write it to the Graph DB all at once. 以下内容将使我们能够解析RDD中的每条消息，然后将其一次性全部写入Graph DB。

dataStream.map( message -> {
    return (ParseResult) Neo4JMessageParser.parse(message);
}).foreachRDD( rdd -> {
    List<ParseResult> parseResults = rdd.collect();
    String cypherQuery = Neo4JMessageParser.buildQuery(parseResults);
    Neo4JRepository.update(cypherQuery);
    // commit offsets
});

By doing this, we should be able to reduce the overhead associated with having to do a write for each incoming message. 这样，我们应该能够减少与必须为每个传入消息进行写操作相关的开销。

来自Apache Spark Streaming中JavaRDD的多个元素的单个输出请求

问题描述

Summary 摘要

Environment 环境

Our Stream 我们的流

Possibilities for Optimization 优化的可能性

1 个解决方案

解决方案1
0 已采纳 2017-04-12 06:17:26

来自Apache Spark Streaming中JavaRDD的多个元素的单个输出请求

问题描述

Summary 摘要

Environment 环境

Our Stream 我们的流

Possibilities for Optimization 优化的可能性

1 个解决方案

解决方案1 0 已采纳 2017-04-12 06:17:26

解决方案1
0 已采纳 2017-04-12 06:17:26