在 Spark 中使用 jdbc 驱动程序连接到 Hive

Question

I need to move data from remote Hive to local Hive with Spark.我需要使用 Spark 将数据从远程 Hive 移动到本地 Hive。 I try to connect to remote hive with JDBC driver: 'org.apache.hive.jdbc.HiveDriver'.我尝试使用 JDBC 驱动程序连接到远程 hive：'org.apache.hive.jdbc.HiveDriver'。 I'm now trying to read from Hive and the result is the column headers in the column values in stead of the actual data:我现在正在尝试从 Hive 读取结果是列值中的列标题而不是实际数据：

df = self.spark_session.read.format('JDBC') \
         .option('url', "jdbc:hive2://{self.host}:{self.port}/{self.database}") \
         .option('driver', 'org.apache.hive.jdbc.HiveDriver') \
         .option("user", self.username) \
         .option("password", self.password)
         .option('dbtable', 'test_table') \
         .load()
df.show()

Result:结果：

+----------+
|str_column|
+----------+
|str_column|
|str_column|
|str_column|
|str_column|
|str_column|
+----------+

I know that Hive JDBC isn't an official support in Apache Spark.我知道 Hive JDBC 不是 Apache Spark 的官方支持。 But I have already found solutions to download from other unsupported sources, such as IMB Informix.但我已经找到了从其他不受支持的来源下载的解决方案，例如 IMB Informix。 Maybe someone has already solved this problem.也许有人已经解决了这个问题。

Answer 1

After debug&trace the code we will find the problem in JdbcDialect。There is no HiveDialect so spark will use default JdbcDialect.quoteIdentifier。 So you should implement a HiveDialect to fix this problem:调试和跟踪代码后，我们会发现问题在 JdbcDialect。没有 HiveDialect，所以 spark 将使用默认的 JdbcDialect.quoteIdentifier。所以你应该实现一个 HiveDialect 来解决这个问题：

import org.apache.spark.sql.jdbc.JdbcDialect

class HiveDialect extends JdbcDialect{
  override def canHandle(url: String): Boolean = 
    url.startsWith("jdbc:hive2")
  

  override def quoteIdentifier(colName: String): String = {
    if(colName.contains(".")){
      var colName1 = colName.substring(colName.indexOf(".") + 1)
      return s"`$colName1`"
    }
    s"`$colName`"
  }
}

And then register the Dialect by:然后通过以下方式注册方言：

JdbcDialects.registerDialect(new HiveDialect)

At last, add option hive.resultset.use.unique.column.names=false to the url like this最后，像这样将选项 hive.resultset.use.unique.column.names=false 添加到 url

option("url", "jdbc:hive2://bigdata01:10000?hive.resultset.use.unique.column.names=false")

refer to csdn blog参考csdn博客

Answer 2

Apache Kyuubi has provided a Hive dialect plugin here. Apache Kyuubi 在这里提供了一个 Hive 方言插件。 https://kyuubi.readthedocs.io/en/latest/extensions/engines/spark/jdbc-dialect.html https://kyuubi.readthedocs.io/en/latest/extensions/engines/spark/jdbc-dialect.html

Hive Dialect plugin aims to provide Hive Dialect support to Spark's JDBC source. Hive 方言插件旨在为 Spark 的 JDBC 源提供 Hive 方言支持。 It will auto registered to Spark and applied to JDBC sources with url prefix of jdbc:hive2:// or jdbc:kyuubi:// .它将自动注册到 Spark 并应用于 url 前缀为jdbc:hive2://或jdbc:kyuubi://的 JDBC 源。 It will quote identifier in Hive SQL style, eg.它将以 Hive SQL 样式引用标识符，例如。 Quote table.column in table .引用 table.column in table 。 column . column 。

compile and get the dialect plugin from Kyuubi.编译并从 Kyuubi 获取方言插件。 (It's a standalone Spark plugin, which is independent from Kyuubi) （它是一个独立的 Spark 插件，独立于 Kyuubi）
put jar into $SPARK_HOME/jars将 jar 放入 $SPARK_HOME/jars
add plugin to config spark.sql.extensions=org.apache.spark.sql.dialect.KyuubiSparkJdbcDialectExtension , it will be auto registered to spark将插件添加到配置spark.sql.extensions=org.apache.spark.sql.dialect.KyuubiSparkJdbcDialectExtension ，它将自动注册到 spark

在 Spark 中使用 jdbc 驱动程序连接到 Hive

问题描述

2 个解决方案

解决方案1
0 2022-01-04 08:00:58

解决方案2
0 2022-12-12 08:21:48

在 Spark 中使用 jdbc 驱动程序连接到 Hive

问题描述

2 个解决方案

解决方案1 0 2022-01-04 08:00:58

解决方案2 0 2022-12-12 08:21:48

解决方案1
0 2022-01-04 08:00:58

解决方案2
0 2022-12-12 08:21:48