In the Spark official document, there are two types of SQL syntax mentioned: Spark native SQL syntax and Hive QL syntax. I couldn't find the detail explanation about their difference. And I'm getting confused with the following questions:
HiveQL is a mixture of SQL-92, MySQL, and Oracle's SQL dialect. It also provides features from later SQL standards such as window functions. Additionally, HiveQL extends some features which don't belong to SQL standards. They are inspired by MapReduce,eg, multitable inserts.
Briefly speaking, you can analyze data with the Java-based power of MapReduce via the SQL-like HiveQL since Apache Hive is a kind of data warehouse on top of Hadoop.
With Spark SQL , you can read and write data in a variety of structured format and one of them is Hive tables. Spark SQL supports ANSI SQL:2003-compliant commands and HiveQL. In a nutshell, you can manipulate data with the power of Spark engine via the SQL-like Spark SQL and Spark SQL covers the majority features of HiveQL.
When working with Hive, you must instantiate SparkSession with Hive support, including connectivity to a persistent Hive metastore, support for Hive serdes, and Hive user-defined functions.
Users who do not have an existing Hive deployment can still enable Hive support. Spark deals with the storage for you.
val spark = SparkSession
.builder()
.appName("Spark Hive Example")
.config("spark.sql.warehouse.dir", warehouseLocation)
.enableHiveSupport()
.getOrCreate()
rlike
is not covered in HiveQL. /* Example of Utilizing HiveQL via Spark SQL */
CREATE TABLE person (id INT, name STRING, age INT, class INT, address STRING);
INSERT INTO person VALUES
(100, 'John', 30, 1, 'Street 1'),
(200, 'Mary', NULL, 1, 'Street 2'),
(300, 'Mike', 80, 3, 'Street 3'),
(400, 'Dan', 50, 4, 'Street 4');
/* Utilize a feature from HiveQL */
SELECT * FROM person
LATERAL VIEW EXPLODE(ARRAY(30, 60)) tabelName AS c_age
LATERAL VIEW EXPLODE(ARRAY(40, 80)) AS d_age;
Spark documentation lists the known incompatibilities
I've also found some incompabilities due to bugs in spark parser. It looks like hive is more robust.
You may also find differences in spark serialization/desearization implementation as explained in this answer . Basically you must adjust these properties:
spark.sql.hive.convertMetastoreOrc=false
spark.sql.hive.convertMetastoreParquet=false
but beware that it will have a performance penalty.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.