简体繁体 English

HIVE 上的 Spark-SQL 插件

[英]Spark-SQL plug in on HIVE

原文 2021-07-30 18:02:20 3 1 apache-spark/ hive/ apache-spark-sql

HIVE has a metastore and HIVESERVER2 listens for SQL requests; HIVE 有一个 metastore，HIVESERVER2 监听 SQL 请求； with the help of metastore, the query is executed and the result is passed back.在 Metastore 的帮助下，执行查询并将结果传回。 The Thrift framework is actually customised as HIVESERVER2. Thrift 框架实际上被定制为HIVESERVER2。 In this way, HIVE is acting as a service.这样，HIVE 就充当了一个服务。 Via programming language, we can use HIVE as a database.通过编程语言，我们可以使用 HIVE 作为数据库。

The relationship between Spark-SQL and HIVE is that: Spark-SQL与HIVE的关系是：

Spark-SQL just utilises the HIVE setup (HDFS file system, HIVE Metastore, Hiveserver2). Spark-SQL 仅使用 HIVE 设置（HDFS 文件系统、HIVE Metastore、Hiveserver2）。 When we invoke /sbin/start-thriftserver2.sh (present in spark installation), we are supposed to give hiveserver2 port number, and the hostname.当我们调用 /sbin/start-thriftserver2.sh（存在于 spark 安装中）时，我们应该提供 hiveserver2 端口号和主机名。 Then via spark's beeline, we can actually create, drop and manipulate tables in HIVE. The API can be either Spark-SQL or HIVE QL.然后通过spark的beeline，我们实际上可以创建，删除和操作HIVE中的表。API可以是Spark-SQL或HIVE QL。 If we create a table / drop a table, it will be clearly visible if we login into HIVE and check(say via HIVE beeline or HIVE CLI).如果我们创建一个表/删除一个表，如果我们登录到 HIVE 并检查（比如通过 HIVE 直线或 HIVE CLI），它将清晰可见。 To put in other words, changes made via Spark can be seen in HIVE tables.换句话说，通过 Spark 所做的更改可以在 HIVE 表中看到。

My understanding is that Spark does not have its own meta store setup like HIVE. Spark just utilises the HIVE setup and simply the SQL execution happens via Spark SQL API.我的理解是 Spark 没有自己的元存储设置，如 HIVE。Spark 仅使用 HIVE 设置，而 SQL 的执行仅通过 Spark SQL API 发生。

Is my understanding correct here?我的理解在这里正确吗？

Then I am little confused about the usage of bin/spark-sql.sh (which is also present in Spark installation).然后我对 bin/spark-sql.sh 的用法有点困惑（它也存在于 Spark 安装中）。 Documentation says that via this SQL shell, we can create tables like we do above (via Thrift Server/Beeline).文档说通过这个 SQL shell，我们可以像上面那样创建表（通过 Thrift Server/Beeline）。 Now my question is: How the metadata information is maintained by spark then?现在我的问题是：元数据信息是如何被spark维护的呢？

Or like the first approach, can we make spark-sql CLI to communicate to HIVE (to be specific: hiveserver2 of HIVE)?或者像第一种方法，我们可以让 spark-sql CLI 与 HIVE（具体来说：HIVE 的 hiveserver2）通信吗？ If yes, how can we do that?如果是，我们该怎么做？

Thanks in advance!提前致谢！

1 个解决方案

My understanding is that Spark does not have its own meta store setup like HIVE我的理解是 Spark 没有自己的元存储设置，如 HIVE

Spark will start a Derby server on its own, if a Hive metastore is not provided如果未提供 Hive 元存储，Spark 将自行启动 Derby 服务器

can we make spark-sql CLI to communicate to HIVE我们可以让 spark-sql CLI 与 HIVE 通信吗

Start an external metastore process, add a hive-site.xml file to $SPARK_CONF_DIR with hive.metastore.uris , or use SET SQL statements for the same.启动外部 Metastore 进程，将hive-site.xml文件添加到$SPARK_CONF_DIR和hive.metastore.uris ，或使用SET SQL 语句。

Then spark-sql CLI should be able to query Hive tables.然后spark-sql CLI 应该可以查询 Hive 个表。 From code, you need to use enableHiveSupport() method on the SparkSession.从代码中，您需要在 SparkSession 上使用enableHiveSupport()方法。