简体   繁体   中英

A single output request from multiple elements of a JavaRDD in Apache Spark Streaming

Summary

My question is about how Apache Spark Streaming can handle an output operation that takes a long time by either improving parallelization or by combining many writes into a single, larger write. In this case, the write is a cypher request to Neo4J, but it could apply to other data storage.


Environment

I have an Apache Spark Streaming application in Java that writes to 2 datastores: Elasticsearch and Neo4j. Here are the versions:

  • Java 8
  • Apache Spark 2.11
  • Neo4J 3.1.1
  • Neo4J Java Bolt Driver 1.1.2

The Elasticsearch output was easy enough as I used the Elasticsearch-Hadoop for Apache Spark library.


Our Stream

Our input is a stream from Kafka received on a particular topic, and I deserialize the elements of the stream through a map function to create a JavaDStream<[OurMessage]> dataStream . I then do transforms on this message to create a cypher query String cypherRequest (using an OurMessage to String transformation) that is sent to a singleton that manages the Bolt Driver connection to Neo4j (I know I should use a connection pool, but maybe that's another question). The cypher query produces a number of nodes and/or edges based on the contents of OurMessage.

The code looks something like the following.

dataStream.foreachRDD( rdd -> {
    rdd.foreach( cypherQuery -> {
        BoltDriverSingleton.getInstance().update(cypherQuery);
    });
});


Possibilities for Optimization

I have two thoughts about how to improve throughput:

  1. I am not sure if Spark Streaming parallelization goes down to the RDD element level. Meaning, the output of RDDs can be parallelized (within `stream.foreachRDD()`, but can each element of the RDD be parallelized (within `rdd.foreach()`). If the latter were the case, would a `reduce` transformation on our `dataStream` increase the ability for Spark to output this data in parallel (each JavaRDD would contain exactly one cypher query)?
  2. Even with improved parallelization, our performance would further increase if I could implement some sort of Builder that takes each element of the RDD to create a single cypher query that adds the nodes/edges from all elements, instead of one cypher query for each RDD. But, how would I be able to do this without using another kafka instance, which may be overkill?

Am I over thinking this? I've tried to research so much that I might be in too deep.


Aside: I apologize in advance if any of this is completely wrong. You don't know what you don't know, and I've just started working with Apache Spark and Java 8 w/ lambdas. As Spark users must know by now, either Spark has a steep learning curve due to it's very different paradigm, or I'm an idiot :).

Thanks to anyone who might be able to help; this is my first StackOverflow question in a long time, so please leave feedback and I will be responsive and correct this question as needed.

I think all we need is a simple Map/Reduce. The following should allow us to parse each message in the RDD and then write it to the Graph DB all at once.

dataStream.map( message -> {
    return (ParseResult) Neo4JMessageParser.parse(message);
}).foreachRDD( rdd -> {
    List<ParseResult> parseResults = rdd.collect();
    String cypherQuery = Neo4JMessageParser.buildQuery(parseResults);
    Neo4JRepository.update(cypherQuery);
    // commit offsets
});

By doing this, we should be able to reduce the overhead associated with having to do a write for each incoming message.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM