简体   繁体   English

Spark SQL 本机语法和 Spark 中的 Hive QL 语法有什么区别?

[英]what's the difference between Spark SQL native syntax and Hive QL syntax in Spark?

In the Spark official document, there are two types of SQL syntax mentioned: Spark native SQL syntax and Hive QL syntax.在Spark官方文档中,提到了两种SQL语法:Spark原生SQL语法和Hive QL语法。 I couldn't find the detail explanation about their difference.我找不到关于它们区别的详细解释。 And I'm getting confused with the following questions:我对以下问题感到困惑:

  1. Is Spark native SQL syntax a subset of the Hive QL? Spark 本机 SQL 语法是 Hive QL 的子集吗? I asked it because in some articles they said like this.我问它是因为在某些文章中他们是这样说的。 And according to the explaining in the Spark official page https://spark.apache.org/docs/3.0.0-preview2/sql-migration-guide.html#compatibility-with-apache-hive , it seems that Spark SQL does not support all features of Hive QL.根据 Spark 官方页面https://spark.apache.org/docs/3.0.0-preview2/sql-migration-guide.html#compatibility-with-apache-hive中的解释,似乎 Spark SQL 确实不支持 Hive QL 的所有功能。
  2. If the question 1 is yes, why I can run "join A rlike B" in Spark SQL but not in Hive?如果问题 1 是,为什么我可以在 Spark SQL 中运行“join A rlike B”,但不能在 Hive 中运行?
  3. How does Spark treat a SQL statement as Spark native SQL or Hive QL? Spark 如何将 SQL 语句视为 Spark 本机 SQL 或 Hive QL?
  4. when we use enableHiveSupport during initialization of Spark Session, does it mean Spark will treat all given SQL statement as Hive QL?当我们在 Spark Session 初始化期间使用 enableHiveSupport 时,是否意味着 Spark 会将所有给定的 SQL 语句视为 Hive QL?

Prologue序幕

HiveQL is a mixture of SQL-92, MySQL, and Oracle's SQL dialect. HiveQL是 SQL-92、MySQL 和 Oracle 的 SQL 方言的混合体。 It also provides features from later SQL standards such as window functions.它还提供来自后来的 SQL 标准的功能,例如 window 功能。 Additionally, HiveQL extends some features which don't belong to SQL standards.此外,HiveQL 扩展了一些不属于 SQL 标准的功能。 They are inspired by MapReduce,eg, multitable inserts.它们的灵感来自 MapReduce,例如,多表插入。
Briefly speaking, you can analyze data with the Java-based power of MapReduce via the SQL-like HiveQL since Apache Hive is a kind of data warehouse on top of Hadoop.简而言之,由于 Apache Hive 是一种基于 Hadoop 的数据仓库,因此您可以通过类似 SQL 的 HiveQL 使用 MapReduce 的基于 Java 的强大功能来分析数据。

With Spark SQL , you can read and write data in a variety of structured format and one of them is Hive tables.使用Spark SQL ,您可以读写各种结构化格式的数据,其中之一是 Hive 表。 Spark SQL supports ANSI SQL:2003-compliant commands and HiveQL. Spark SQL 支持符合 ANSI SQL:2003 的命令和 HiveQL。 In a nutshell, you can manipulate data with the power of Spark engine via the SQL-like Spark SQL and Spark SQL covers the majority features of HiveQL.简而言之,您可以通过类似 SQL 的 Spark SQL 和 Spark SQL 涵盖 HiveQL 的大部分功能,利用 Spark 引擎的强大功能来操作数据。

When working with Hive, you must instantiate SparkSession with Hive support, including connectivity to a persistent Hive metastore, support for Hive serdes, and Hive user-defined functions.使用 Hive 时,必须使用 882086660602288 支持实例化 SparkSession,包括连接到持久性 Hive 元存储、支持 Hive serdes 和 Hive 用户定义函数。
Users who do not have an existing Hive deployment can still enable Hive support.没有现有 Hive 部署的用户仍然可以启用 882086660602288 支持。 Spark deals with the storage for you. Spark 为您处理存储。

val spark = SparkSession
  .builder()
  .appName("Spark Hive Example")
  .config("spark.sql.warehouse.dir", warehouseLocation)
  .enableHiveSupport()
  .getOrCreate()

Answers答案

  1. I would say that they are highly overlapped.我会说它们高度重叠。 Spark SQL is almost the superset of HiveQL. Spark SQL 几乎是 HiveQL 的超集
  2. Spar SQL is not the subset of HiveQL; Spar SQL 不是 HiveQL 的子集; about the latter part, it's because regular expression like predicates are introduced in the SQL:2003 standard.关于后半部分,是因为在SQL:2003标准中引入了像谓词这样的正则表达式 Spark SQL is SQL:2003 compliant and HiveQL only implements very few features introduced in SQL:2003 and among the few features, rlike is not covered in HiveQL. Spark SQL 是 SQL:2003 兼容的,HiveQL 只实现了 SQL:2003 中引入的极少数特性,在少数特性中, rlike没有包含在 HiveQL 中。
  3. You gotta view the source code of Spark.您必须查看 Spark 的源代码 Practically speaking, in my opinion, one only needs to keep in mind Spark SQL helps you read and write data from a variety of data sources and it covers HiveQL.实际上,在我看来,只需要记住 Spark SQL 可以帮助您从各种数据源读取和写入数据,它涵盖了 HiveQL。 Spark SQL is conferred with most capabilities of HiveQL. Spark SQL 被赋予了 HiveQL 的大部分功能。
  4. Not exactly.不完全是。 Spark SQL is Spark SQL. With the functionality as you mentioned enabled, it usually means you're going to communicate with Apache Hive. Even you don't have an entity of Apache Hive, with that functionality enabled, you can utilize some features of HiveQL via Spark SQL since Spark SQL supports the majority features of HiveQL and Spark has an internal mechanism to deal with the storage of data warehouse. Spark SQL 是 Spark SQL。启用您提到的功能后,通常意味着您将与 Apache Hive 进行通信。即使您没有 Apache Hive 的实体,您也可以利用启用的某些功能HiveQL via Spark SQL因为 Spark SQL 支持 HiveQL 的大部分功能,而且 Spark 有一个内部机制来处理数据仓库的存储。
/* Example of Utilizing HiveQL via Spark SQL */
CREATE TABLE person (id INT, name STRING, age INT, class INT, address STRING);
INSERT INTO person VALUES
    (100, 'John', 30, 1, 'Street 1'),
    (200, 'Mary', NULL, 1, 'Street 2'),
    (300, 'Mike', 80, 3, 'Street 3'),
    (400, 'Dan', 50, 4, 'Street 4');

/* Utilize a feature from HiveQL */
SELECT * FROM person
    LATERAL VIEW EXPLODE(ARRAY(30, 60)) tabelName AS c_age
    LATERAL VIEW EXPLODE(ARRAY(40, 80)) AS d_age;

References参考

  1. Damji, J., Wenig, B., Das, T. and Lee, D., 2020. Learning Spark: Lightning-Fast Data Analytics . Damji, J.、Wenig, B.、Das, T. 和 Lee, D.,2020 年。Learning Spark:快如闪电的数据分析 2nd ed.第二版。 Sebastopol, CA: O'Reilly, pp. 83-112.加利福尼亚州塞瓦斯托波尔:O'Reilly,第 83-112 页。
  2. ISO/IEC JTC 1/SC 32 Data management and interchange, 1992, Information technology - Database languages - SQL , ISO/IEC 9075:1992, the USA ISO/IEC JTC 1/SC 32 数据管理和交换,1992 年,信息技术 - 数据库语言 - SQL ,ISO/IEC 9075:1992,美国
  3. ISO/IEC JTC 1/SC 32 Data management and interchange, 2003, Information technology — Database languages — SQL — Part 2: Foundation (SQL/Foundation) , ISO/IEC 9075-2:2003, the USA ISO/IEC JTC 1/SC 32 数据管理和交换,2003 年,信息技术 — 数据库语言 — SQL — 第 2 部分:基础 (SQL/Foundation) ,ISO/IEC 9075-2:2003,美国
  4. White, T. (2015).怀特,T.(2015 年)。 Hadoop: The Definitive Guide. Hadoop:权威指南。 4th ed .第 4 版 Sebastopol, O'Reilly Media, pp. 471-518. Sebastopol,O'Reilly Media,第 471-518 页。
  5. cwiki.apache.org . cwiki.apache.org 2013. LanguageManual LateralView . 2013. LanguageManual LateralView [ONLINE] Available at:https://cwiki.apache.org/confluence/display/Hive/LanguageManual . [在线]网址:https://cwiki.apache.org/confluence/display/Hive/LanguageManual
  6. spark.apache.org .火花.apache.org 2021. Hive Tables . 2021. Hive 表 [ONLINE] Available at: https://spark.apache.org/docs/latest/sql-data-sources-hive-tables.html . [在线] 网址:https://spark.apache.org/docs/latest/sql-data-sources-hive-tables.html

Spark documentation lists the known incompatibilities Spark 文档列出了已知的不兼容性

I've also found some incompabilities due to bugs in spark parser.由于 spark 解析器中的错误,我还发现了一些不兼容问题。 It looks like hive is more robust.看起来 hive 更健壮。

You may also find differences in spark serialization/desearization implementation as explained in this answer .本答案中所述,您可能还会发现 spark 序列化/脱焦实现方面的差异。 Basically you must adjust these properties:基本上你必须调整这些属性:

spark.sql.hive.convertMetastoreOrc=false
spark.sql.hive.convertMetastoreParquet=false

but beware that it will have a performance penalty.但要注意它会降低性能。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM