Looking for performance tips on Spark DStream to Parquet files

Question

I want to store Elasticsearch indices to HDFS files not using ES-Hadoop Connector. A proposed solution is using Streaming Custom Receivers to read and save as parquet files and the code is like,

JavaDStream<String> jsonDocs = ssc.union(dsList.get(0), dsList.subList(1, dsList.size())); // I have a couple receivers
jsonDocs.foreachRDD( rdd -> {
    Dataset<Row> ds = spark.read().json(spark.createDataset(rdd.rdd(), Encoders.STRING()));
    ds.write().mode(SaveMode.Append).option("compression","gzip").parquet(path);

With this, I get some okay performance number, however, for I am new to Spark, I wonder if there is any room to improve. For example, I see that json() and parquet() jobs take most of time, and is json() jobs taking long time necessary or can it be avoided? (I have omitted some other jobs, eg count(), from the code snippet for simplicity.)

Using Structured Streaming looks a good but haven't found a simple solution with Custom Receivers Streaming. Thanks in advance,

Answer 1

spark.read().json(spark.createDataset(rdd.rdd(), Encoders.STRING()));

Looking above, reading json() might not best for performance sensitive work. Spark uses JacksonParser in it's data source api for reading json. If your json structure is simple try to parse it by yourself using map() function to create Row.

Looking for performance tips on Spark DStream to Parquet files

Question

1 answers

solution1
0 2019-04-01 21:39:36

Looking for performance tips on Spark DStream to Parquet files

Question

1 answers

solution1 0 2019-04-01 21:39:36

solution1
0 2019-04-01 21:39:36