简体   繁体   中英

what's the difference between Spark SQL native syntax and Hive QL syntax in Spark?

In the Spark official document, there are two types of SQL syntax mentioned: Spark native SQL syntax and Hive QL syntax. I couldn't find the detail explanation about their difference. And I'm getting confused with the following questions:

  1. Is Spark native SQL syntax a subset of the Hive QL? I asked it because in some articles they said like this. And according to the explaining in the Spark official page https://spark.apache.org/docs/3.0.0-preview2/sql-migration-guide.html#compatibility-with-apache-hive , it seems that Spark SQL does not support all features of Hive QL.
  2. If the question 1 is yes, why I can run "join A rlike B" in Spark SQL but not in Hive?
  3. How does Spark treat a SQL statement as Spark native SQL or Hive QL?
  4. when we use enableHiveSupport during initialization of Spark Session, does it mean Spark will treat all given SQL statement as Hive QL?

Prologue

HiveQL is a mixture of SQL-92, MySQL, and Oracle's SQL dialect. It also provides features from later SQL standards such as window functions. Additionally, HiveQL extends some features which don't belong to SQL standards. They are inspired by MapReduce,eg, multitable inserts.
Briefly speaking, you can analyze data with the Java-based power of MapReduce via the SQL-like HiveQL since Apache Hive is a kind of data warehouse on top of Hadoop.

With Spark SQL , you can read and write data in a variety of structured format and one of them is Hive tables. Spark SQL supports ANSI SQL:2003-compliant commands and HiveQL. In a nutshell, you can manipulate data with the power of Spark engine via the SQL-like Spark SQL and Spark SQL covers the majority features of HiveQL.

When working with Hive, you must instantiate SparkSession with Hive support, including connectivity to a persistent Hive metastore, support for Hive serdes, and Hive user-defined functions.
Users who do not have an existing Hive deployment can still enable Hive support. Spark deals with the storage for you.

val spark = SparkSession
  .builder()
  .appName("Spark Hive Example")
  .config("spark.sql.warehouse.dir", warehouseLocation)
  .enableHiveSupport()
  .getOrCreate()

Answers

  1. I would say that they are highly overlapped. Spark SQL is almost the superset of HiveQL.
  2. Spar SQL is not the subset of HiveQL; about the latter part, it's because regular expression like predicates are introduced in the SQL:2003 standard. Spark SQL is SQL:2003 compliant and HiveQL only implements very few features introduced in SQL:2003 and among the few features, rlike is not covered in HiveQL.
  3. You gotta view the source code of Spark. Practically speaking, in my opinion, one only needs to keep in mind Spark SQL helps you read and write data from a variety of data sources and it covers HiveQL. Spark SQL is conferred with most capabilities of HiveQL.
  4. Not exactly. Spark SQL is Spark SQL. With the functionality as you mentioned enabled, it usually means you're going to communicate with Apache Hive. Even you don't have an entity of Apache Hive, with that functionality enabled, you can utilize some features of HiveQL via Spark SQL since Spark SQL supports the majority features of HiveQL and Spark has an internal mechanism to deal with the storage of data warehouse.
/* Example of Utilizing HiveQL via Spark SQL */
CREATE TABLE person (id INT, name STRING, age INT, class INT, address STRING);
INSERT INTO person VALUES
    (100, 'John', 30, 1, 'Street 1'),
    (200, 'Mary', NULL, 1, 'Street 2'),
    (300, 'Mike', 80, 3, 'Street 3'),
    (400, 'Dan', 50, 4, 'Street 4');

/* Utilize a feature from HiveQL */
SELECT * FROM person
    LATERAL VIEW EXPLODE(ARRAY(30, 60)) tabelName AS c_age
    LATERAL VIEW EXPLODE(ARRAY(40, 80)) AS d_age;

References

  1. Damji, J., Wenig, B., Das, T. and Lee, D., 2020. Learning Spark: Lightning-Fast Data Analytics . 2nd ed. Sebastopol, CA: O'Reilly, pp. 83-112.
  2. ISO/IEC JTC 1/SC 32 Data management and interchange, 1992, Information technology - Database languages - SQL , ISO/IEC 9075:1992, the USA
  3. ISO/IEC JTC 1/SC 32 Data management and interchange, 2003, Information technology — Database languages — SQL — Part 2: Foundation (SQL/Foundation) , ISO/IEC 9075-2:2003, the USA
  4. White, T. (2015). Hadoop: The Definitive Guide. 4th ed . Sebastopol, O'Reilly Media, pp. 471-518.
  5. cwiki.apache.org . 2013. LanguageManual LateralView . [ONLINE] Available at:https://cwiki.apache.org/confluence/display/Hive/LanguageManual .
  6. spark.apache.org . 2021. Hive Tables . [ONLINE] Available at: https://spark.apache.org/docs/latest/sql-data-sources-hive-tables.html .

Spark documentation lists the known incompatibilities

I've also found some incompabilities due to bugs in spark parser. It looks like hive is more robust.

You may also find differences in spark serialization/desearization implementation as explained in this answer . Basically you must adjust these properties:

spark.sql.hive.convertMetastoreOrc=false
spark.sql.hive.convertMetastoreParquet=false

but beware that it will have a performance penalty.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM