简体   繁体   English

Spark 2与Hive MetaStore的连接

[英]Spark 2 connection to Hive MetaStore

FOr the last 3 weeks, I am trying to connect to the hive metaStore remotely from my machine. 在过去的3周中,我尝试从计算机远程连接到配置单元metaStore。

I have all the configuration files : 我有所有的配置文件:

  • hive-site 蜂巢现场
  • and the configuration to the hdfs 以及hdfs的配置

I have already managed to use files from the hdfs, so it works. 我已经设法使用了hdfs中的文件,因此可以正常工作。

I think I have all the jars for the spark->hive connection 我想我有所有用于spark-> hive连接的jar

the code I wrote is the following one : 我编写的代码如下:

import org.apache.spark.sql.Row

import org.apache.spark.sql.SparkSession

val warehouseLocation = "/user/hive/warehouse"


val spark = SparkSession
  .builder()
  .appName("SparkHiveExample")
  .config("spark.sql.warehouse.dir", warehouseLocation)
  .enableHiveSupport()
  .getOrCreate()

It throws this exception : 它抛出此异常:

Unable to instantiate SparkSession with Hive support because hive classes are not found. at org.apache.sql.SparkSession$builder.enableHiveSupport

what jar am I missing ? 我想念什么罐子?

Observations 意见

If I don't use enablehiveSupport(), then it works. 如果我不使用enablehiveSupport(),那么它将起作用。

But then I get the next exception, 但是我得到了下一个例外,

could not initialize class org.apach.spark.rdd.RDDOperationScope

I am not sure, but this is happening may be because you forget to export HIVE_HOME on Hive installation. 我不确定,但是发生这种情况可能是因为您忘记了在Hive安装中export HIVE_HOME So SparkSession is unable to find where to look for hive classes. 因此, SparkSession无法找到在哪里寻找配置单元类。 As a result you also need to edit your bash_profile. 因此,您还需要编辑bash_profile。

nano ~/.bash_profile

Add the following lines to your bash_profile 将以下行添加到您的bash_profile中

export HIVE_HOME=/usr/local/hive
export PATH=$PATH:$HIVE_HOME/bin

Save this file and then 保存此文件,然后

source ~/.bash_profile

After this try to run your code. 之后,尝试运行您的代码。 May this can solve your problem. 也许这可以解决您的问题。

I had done this in the past and it was not very straightforward unfortunately. 我过去曾经这样做过,不幸的是,这并不是很简单。 I created a custom distribution of spark with hive using the command 我使用以下命令使用hive创建了一个自定义的spark分发

./make-distribution.sh --name my-spark-dist --tgz  -Phadoop-2.6 -Phive -Phive-thriftserver -Pyarn 

I used the client config for Hive hive-site.xml, core-site.xml and hdfs-site.xml pointing to the remote hive and hdfs and had to change firewall config to allow connection to the thrift server port. 我将Hive的客户端配置hive-site.xml,core-site.xml和hdfs-site.xml指向远程hive和hdfs,并且不得不更改防火墙配置以允许连接到节俭服务器端口。

Spark is compiled with Hive 1.2.1 and the documentation says you can use a metastore of a lower version but that doesn't work. Spark是使用Hive 1.2.1编译的,文档说明您可以使用较低版本的元存储库,但这不起作用。 The lowest version that works is 1.2.0 because at runtime it picks up the jars specified in your config property but at build-time it uses hive version 1.2.1 I had raised a spark bug for the same https://issues.apache.org/jira/browse/SPARK-14492 I had to upgrade my metastore DB and service upto version 1.2.0 using the upgrade tool provided with Hive 可以使用的最低版本是1.2.0,因为在运行时它将拾取在config属性中指定的jar,但是在构建时,它将使用配置单元版本1.2.1。对于同一https://issues.apache,我提出了一个火花错误.org / jira / browse / SPARK- 14492我必须使用Hive随附的升级工具将我的metastore数据库和服务升级到1.2.0版

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM