简体繁体中英

when using spark as just etl process, what is faster between rdd and dataset in Spark2.1?

原文 2017-06-01 02:19:12 7 1 performance/ apache-spark/ dataset/ rdd

Hi I'm using spark for etl.

I am just loading json string as rdd from hdfs, parsing them as json, manipulating each json (without aggregation or shuffle) and then save them as json sting to hdfs.

I don't need any query-like actions so don't need columnar data.

But, many reports say that in Spark 2.1, dataset api is faster than rdd.

I'm confusing that what is more suitable for my situation.

Anyone can tell me whatever about this?

1 answers

In your case Dataframe is the better option, because it is more straight forward.

From your comment, what you want to do can be simply expressed as

spark.read.json("some json file").select("some json fields").write.json("outputpath")

which is all done via dataframe (your json data is read as a dataframe)

Spark RDD: multiple reducebykey or just once

Spark: Collect/parallelize on RDD is “faster” than “doing nothing” on RDD

Number of partitions in RDD and performance in Spark

Normalize SPARK RDD partitions using reduceByKey(numPartitions) or repartition

Compare RDD Objects - Apache Spark

Spark 2.2.0 API: Which one should i prefer Dataset with Groupby combined with aggregate or RDD with ReduceBykey

Scala Spark execution of RDD contiguous subsets

Spark: Fastest way to look up an element in an RDD

Is spark checkpointing faster than caching?

Spark cached RDD is calculated n times

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question Spark RDD: multiple reducebykey or just once Spark: Collect/parallelize on RDD is “faster” than “doing nothing” on RDD Number of partitions in RDD and performance in Spark Normalize SPARK RDD partitions using reduceByKey(numPartitions) or repartition Compare RDD Objects - Apache Spark Spark 2.2.0 API: Which one should i prefer Dataset with Groupby combined with aggregate or RDD with ReduceBykey Scala Spark execution of RDD contiguous subsets Spark: Fastest way to look up an element in an RDD Is spark checkpointing faster than caching? Spark cached RDD is calculated n times

Related Tags

when using spark as just etl process, what is faster between rdd and dataset in Spark2.1?

Question

1 answers

solution1 0 2017-06-01 04:25:22

solution1
0 2017-06-01 04:25:22