简体   繁体   English

Hive on Spark和作为Hive执行引擎的Spark:有什么区别

[英]Hive on Spark and Spark as hive execution engine: What's the difference

What's the difference between Spark using Hive metastore and Spark running as hive execution engine? 使用Hive Metastore的Spark和作为Hive执行引擎运行的Spark有什么区别? I have followed THIS TUTORIAL to configure spark and hive, and I have successfully created, populated and analysed data from hive table. 我已经按照本教程配置了spark和hive,并且已经从hive表中成功创建,填充和分析了数据。 Now what confuses me is what have I done? 现在让我感到困惑的是我做了什么?

a) Did I configure Spark to use Hive metastore and analysed data in hive table using SparkSQL? a)我是否将Spark配置为使用Hive Metastore并使用SparkSQL分析了Hive表中的数据?
b) Did I actually used Spark as Hive execution engine and analysed data in hive table using HiveQL,which is what I want to do. b)我是否真的将Spark用作Hive执行引擎并使用HiveQL分析了Hive表中的数据,这就是我想要做的。

I will try to summarize what I have done to configure spark and hive 我将尝试总结配置火花和蜂巢所做的工作

a) I followed that above tutorial and configured spark and hive a)我按照上面的教程进行了配置,并配置了spark和hive
b) Wrote my /conf/hive-site.xml Like this and b) 像这样写我的/conf/hive-site.xml并
c) After that I wrote some codes that would connect to hive metastore and do my analysis. c)之后,我编写了一些将连接到配置单元metastore并进行分析的代码。 I am using java for this and this piece of code starts spark session 我为此使用Java,这段代码启动了Spark会话

SparkSession spark = SparkSession
                .builder()
                .appName("Java Spark SQL basic example")
                .enableHiveSupport()
                .config("spark.sql.warehouse.dir", "hdfs://saurab:9000/user/hive/warehouse")
                .config("mapred.input.dir.recursive", true)
                .config("hive.mapred.supports.subdirectories", true)
                .config("spark.sql.hive.thriftServer.singleSession", true)
                .master("local")
                .getOrCreate();

And this piece of code will create database and table. 这段代码将创建数据库和表。 Here db=mydb and table1=mytbl 这里db=mydb and table1=mytbl

String query = "CREATE DATABASE IF NOT EXISTS " + db;
        spark.sql(query);
String query = "CREATE EXTERNAL TABLE IF NOT EXISTS " + db + "." + table1
                + " (icode String, " +
                "bill_date String, " +
                "total_amount float, " +
                "bill_no String, " +
                "customer_code String) " +
                "COMMENT \" Sales details \" " +
                "ROW FORMAT DELIMITED FIELDS TERMINATED BY \",\" " +
                "LINES TERMINATED BY  \"\n\" " +
                "STORED AS TEXTFILE " +
                "LOCATION 'hdfs://saurab:9000/ekbana2/' " +
                "tblproperties(\"skip.header.line.count\"=\"1\")";

        spark.sql(query);

Then I create jar and run it using spark-submit 然后我创建jar并使用spark-submit运行它

./bin/spark-submit --master yarn  --jars jars/datanucleus-api-jdo-3.2.6.jar,jars/datanucleus-core-3.2.10.jar,jars/datanucleus-rdbms-3.2.9.jar,/home/saurab/hadoopec/hive/lib/mysql-connector-java-5.1.38.jar --verbose --properties-file /home/saurab/hadoopec/spark/conf/spark-env.sh --files /home/saurab/hadoopec/spark/conf/hive-site.xml --class HiveRead  /home/saurab/sparkProjects/spark_hive/target/myJar-jar-with-dependencies.jar 

Doing this I get what I want but I am not very sure I am doing what I really want to do. 这样我就能得到我想要的,但是我不确定我在做什么。 My question might seem somewhat difficult to understand because I don't know how to explain it.If so please comment and I will try to expand my question 我的问题似乎有点难以理解,因为我不知道如何解释。如果是这样,请发表评论,我将尝试扩大我的问题

Also if there is any tutorial that focuses on spark+hive working, please provide me link and I also want to know if spark reads spark/conf/hive-site.xml or hive/conf/hive-site.xml because I am confused where to set hive.execution.engine=spark . 另外,如果有任何关于spark + hive工作的教程,请提供给我链接,我也想知道spark是否读取spark/conf/hive-site.xmlhive/conf/hive-site.xml因为我很困惑在哪里设置hive.execution.engine=spark Thanks 谢谢

It seems like you're doing two opposite things at once. 似乎您正在同时做两个相反的事情。 The tutorial you linked to is instructions to use Spark as Hive's execution engine (what you described as option b). 您链接到的教程是有关将Spark用作Hive的执行引擎(称为选项b)的说明。 This means that you will run your hive queries almost exactly at before, but behind the scenes Hive will use Spark instead of classic MapReduce. 这意味着您将几乎完全在之前运行hive查询,但在后台Hive将使用Spark而不是经典的MapReduce。 In that case you don't need to write any Java code that uses SparkSession etc'. 在这种情况下,您无需编写任何使用SparkSession等的Java代码。 The code you were writing is doing what you described in option a - using Spark to run Hive queries and use the Hive metastore. 您编写的代码正在执行选项a中所述的操作-使用Spark运行Hive查询并使用Hive元存储库。

So in summary, you don't need to do both - either use the first tutorial configure Spark as you Hive execution engine (of course this will still require installing Spark etc'), OR, write Spark code that executes Hive queries. 因此,总而言之,您不需要两者都做-要么使用第一个教程在Hive执行引擎上配置Spark(当然,这仍然需要安装Spark等),或者编写执行Hive查询的Spark代码。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM