简体   繁体   中英

when using spark as just etl process, what is faster between rdd and dataset in Spark2.1?

Hi I'm using spark for etl.

I am just loading json string as rdd from hdfs, parsing them as json, manipulating each json (without aggregation or shuffle) and then save them as json sting to hdfs.

I don't need any query-like actions so don't need columnar data.

But, many reports say that in Spark 2.1, dataset api is faster than rdd.

I'm confusing that what is more suitable for my situation.

Anyone can tell me whatever about this?

In your case Dataframe is the better option, because it is more straight forward.

From your comment, what you want to do can be simply expressed as

spark.read.json("some json file").select("some json fields").write.json("outputpath")

which is all done via dataframe (your json data is read as a dataframe)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM