简体   繁体   中英

Why use Hive on Spark instead of Spark-SQL?

I'm new to the Data Science field and I don't understand why would someone want to connect Hive to Spark instead of just using Sqark-SQL.

What benefits are there for using Hive on Spark rather than Spark-SQL (other than being able to use Hive code already in production)?

Thanks

That answer above is not correct. The one component that is common between Hive and SparkSQL is SemanticAnalyzer . Hive has significantly better SQL support and a more sophisticated cost based optimizer. My recommendation is to use Hive on Tez opposed to Hive on Spark or SparkSQL as it is production ready, more stable and scalable.

hmm, it seems the only answer here gives an advice to use tez...

back to the original question, benefits for using Hive on Spark, IMHO, the benefits are mainly a better hive feature support, not the HiveQL language support, Hive on Spark has a much better support for hiveserver2 and security features.

in SparkSQL they are really buggy, there is a hiveserver2 impl in SparkSQL, but in latest release version (1.6.x), hiveserver2 in SparkSQL doesn't work with hivevar and hiveconf argument anymore, and the username for login via jdbc doesn't work either... see https://issues.apache.org/jira/browse/SPARK-13983

our requirement is using spark with hiveserver2 in a secure way (with authentication and authorization), currently SparkSQL alone can not provide this, and we do not need to use other hadoop components like HDFS or YARN, we are using spark standalone, so for our requirement, we are using ranger/sentry + Hive on Spark.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM