[英]what's the difference between Spark SQL native syntax and Hive QL syntax in Spark?
In the Spark official document, there are two types of SQL syntax mentioned: Spark native SQL syntax and Hive QL syntax.在Spark官方文档中,提到了两种SQL语法:Spark原生SQL语法和Hive QL语法。 I couldn't find the detail explanation about their difference.
我找不到关于它们区别的详细解释。 And I'm getting confused with the following questions:
我对以下问题感到困惑:
HiveQL is a mixture of SQL-92, MySQL, and Oracle's SQL dialect. HiveQL是 SQL-92、MySQL 和 Oracle 的 SQL 方言的混合体。 It also provides features from later SQL standards such as window functions.
它还提供来自后来的 SQL 标准的功能,例如 window 功能。 Additionally, HiveQL extends some features which don't belong to SQL standards.
此外,HiveQL 扩展了一些不属于 SQL 标准的功能。 They are inspired by MapReduce,eg, multitable inserts.
它们的灵感来自 MapReduce,例如,多表插入。
Briefly speaking, you can analyze data with the Java-based power of MapReduce via the SQL-like HiveQL since Apache Hive is a kind of data warehouse on top of Hadoop.简而言之,由于 Apache Hive 是一种基于 Hadoop 的数据仓库,因此您可以通过类似 SQL 的 HiveQL 使用 MapReduce 的基于 Java 的强大功能来分析数据。
With Spark SQL , you can read and write data in a variety of structured format and one of them is Hive tables.使用Spark SQL ,您可以读写各种结构化格式的数据,其中之一是 Hive 表。 Spark SQL supports ANSI SQL:2003-compliant commands and HiveQL.
Spark SQL 支持符合 ANSI SQL:2003 的命令和 HiveQL。 In a nutshell, you can manipulate data with the power of Spark engine via the SQL-like Spark SQL and Spark SQL covers the majority features of HiveQL.
简而言之,您可以通过类似 SQL 的 Spark SQL 和 Spark SQL 涵盖 HiveQL 的大部分功能,利用 Spark 引擎的强大功能来操作数据。
When working with Hive, you must instantiate SparkSession with Hive support, including connectivity to a persistent Hive metastore, support for Hive serdes, and Hive user-defined functions.使用 Hive 时,必须使用 882086660602288 支持实例化 SparkSession,包括连接到持久性 Hive 元存储、支持 Hive serdes 和 Hive 用户定义函数。
Users who do not have an existing Hive deployment can still enable Hive support.没有现有 Hive 部署的用户仍然可以启用 882086660602288 支持。 Spark deals with the storage for you.
Spark 为您处理存储。
val spark = SparkSession
.builder()
.appName("Spark Hive Example")
.config("spark.sql.warehouse.dir", warehouseLocation)
.enableHiveSupport()
.getOrCreate()
rlike
is not covered in HiveQL. rlike
没有包含在 HiveQL 中。/* Example of Utilizing HiveQL via Spark SQL */
CREATE TABLE person (id INT, name STRING, age INT, class INT, address STRING);
INSERT INTO person VALUES
(100, 'John', 30, 1, 'Street 1'),
(200, 'Mary', NULL, 1, 'Street 2'),
(300, 'Mike', 80, 3, 'Street 3'),
(400, 'Dan', 50, 4, 'Street 4');
/* Utilize a feature from HiveQL */
SELECT * FROM person
LATERAL VIEW EXPLODE(ARRAY(30, 60)) tabelName AS c_age
LATERAL VIEW EXPLODE(ARRAY(40, 80)) AS d_age;
Spark documentation lists the known incompatibilities Spark 文档列出了已知的不兼容性
I've also found some incompabilities due to bugs in spark parser.由于 spark 解析器中的错误,我还发现了一些不兼容问题。 It looks like hive is more robust.
看起来 hive 更健壮。
You may also find differences in spark serialization/desearization implementation as explained in this answer .如本答案中所述,您可能还会发现 spark 序列化/脱焦实现方面的差异。 Basically you must adjust these properties:
基本上你必须调整这些属性:
spark.sql.hive.convertMetastoreOrc=false
spark.sql.hive.convertMetastoreParquet=false
but beware that it will have a performance penalty.但要注意它会降低性能。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.