简体   繁体   中英

Looking for performance tips on Spark DStream to Parquet files

I want to store Elasticsearch indices to HDFS files not using ES-Hadoop Connector. A proposed solution is using Streaming Custom Receivers to read and save as parquet files and the code is like,

JavaDStream<String> jsonDocs = ssc.union(dsList.get(0), dsList.subList(1, dsList.size())); // I have a couple receivers
jsonDocs.foreachRDD( rdd -> {
    Dataset<Row> ds = spark.read().json(spark.createDataset(rdd.rdd(), Encoders.STRING()));
    ds.write().mode(SaveMode.Append).option("compression","gzip").parquet(path); 

With this, I get some okay performance number, however, for I am new to Spark, I wonder if there is any room to improve. For example, I see that json() and parquet() jobs take most of time, and is json() jobs taking long time necessary or can it be avoided? (I have omitted some other jobs, eg count(), from the code snippet for simplicity.)

Using Structured Streaming looks a good but haven't found a simple solution with Custom Receivers Streaming. Thanks in advance,

spark.read().json(spark.createDataset(rdd.rdd(), Encoders.STRING()));

Looking above, reading json() might not best for performance sensitive work. Spark uses JacksonParser in it's data source api for reading json. If your json structure is simple try to parse it by yourself using map() function to create Row.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM