[英]A single output request from multiple elements of a JavaRDD in Apache Spark Streaming
My question is about how Apache Spark Streaming can handle an output operation that takes a long time by either improving parallelization or by combining many writes into a single, larger write. 我的问题是有关Apache Spark Streaming如何通过改进并行化或将许多写入合并为一个较大的写入来处理需要很长时间的输出操作。 In this case, the write is a cypher request to Neo4J, but it could apply to other data storage.
在这种情况下,写操作是对Neo4J的密码请求,但它可以应用于其他数据存储。
I have an Apache Spark Streaming application in Java that writes to 2 datastores: Elasticsearch and Neo4j. 我有一个用Java写的Apache Spark Streaming应用程序,它可以写入2个数据存储:Elasticsearch和Neo4j。 Here are the versions:
这些是版本:
The Elasticsearch output was easy enough as I used the Elasticsearch-Hadoop for Apache Spark library. 当我将Elasticsearch-Hadoop用于Apache Spark库时,Elasticsearch的输出非常容易。
Our input is a stream from Kafka received on a particular topic, and I deserialize the elements of the stream through a map function to create a JavaDStream<[OurMessage]> dataStream
. 我们的输入是从Kafka接收的关于特定主题的流,我通过map函数反序列化流的元素以创建
JavaDStream<[OurMessage]> dataStream
。 I then do transforms on this message to create a cypher query String cypherRequest
(using an OurMessage to String transformation) that is sent to a singleton that manages the Bolt Driver connection to Neo4j (I know I should use a connection pool, but maybe that's another question). 然后,我对此消息进行转换,以创建一个密码查询
String cypherRequest
(使用OurMessage到String的转换),该查询发送到一个单例,该单例管理Bolt Driver与Neo4j的连接(我知道我应该使用连接池,但是也许这是另一个连接池)题)。 The cypher query produces a number of nodes and/or edges based on the contents of OurMessage. 密码查询根据OurMessage的内容生成许多节点和/或边。
The code looks something like the following. 该代码如下所示。
dataStream.foreachRDD( rdd -> {
rdd.foreach( cypherQuery -> {
BoltDriverSingleton.getInstance().update(cypherQuery);
});
});
I have two thoughts about how to improve throughput: 关于如何提高吞吐量,我有两个想法:
Am I over thinking this? 我在想这个吗? I've tried to research so much that I might be in too deep.
我已经尝试了很多研究,以至于我可能太深了。
Thanks to anyone who might be able to help; 感谢任何可能提供帮助的人; this is my first StackOverflow question in a long time, so please leave feedback and I will be responsive and correct this question as needed.
这是很长时间以来我的第一个StackOverflow问题,所以请留下反馈,我会响应并根据需要纠正此问题。
I think all we need is a simple Map/Reduce. 我认为我们所需要的只是一个简单的Map / Reduce。 The following should allow us to parse each message in the RDD and then write it to the Graph DB all at once.
以下内容将使我们能够解析RDD中的每条消息,然后将其一次性全部写入Graph DB。
dataStream.map( message -> {
return (ParseResult) Neo4JMessageParser.parse(message);
}).foreachRDD( rdd -> {
List<ParseResult> parseResults = rdd.collect();
String cypherQuery = Neo4JMessageParser.buildQuery(parseResults);
Neo4JRepository.update(cypherQuery);
// commit offsets
});
By doing this, we should be able to reduce the overhead associated with having to do a write for each incoming message. 这样,我们应该能够减少与必须为每个传入消息进行写操作相关的开销。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.