简体   繁体   中英

why use spark core API (RDD) when you can do most of it in spark-sql

I am learning spark for big data processing. People recommend using HiveContext over SparkSqlContext . and also recommend using dataframes instead of directly using an rdd .

Spark-sql seems is highly optmized for query planner, so it seems that using spark-sql is a better option than using Core api (RDD) via scala ( or python...) . Is there something I am missing ?

The short answer: right, using spark-sql is recommended for most use cases.

The longer answer:

First, it's not really a question of "Scala vs. spark-sql", it's a question of "Spark Core API (RDDs) vs. spark-sql". The language choice is orthogonal to this debate: there are Scala APIs (as well as Java and Python APIs) for both RDDs and spark-sql, so you would probably use Scala in conjunction with spark-sql, for example:

val sc: SparkContext // An existing SparkContext.
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val df = sqlContext.read.json("people.json").registerTempTable("t1")

sqlContext.sql("SELECT * FROM t1 WHERE ...")

So - yes, it would make sense to write most of the "heavy lifting" using SQL, but there would be some Scala (or Java, or Python) code around it.

Now, as for the "RDD vs. SQL" question - as mentioned above, it is usually recommended to use SQL, because it leaves room for Spark to optimize, unlike RDD operations where the developer instructs Spark exactly what to do and how, passing transformations that are opaque to Spark's engine.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM