简体   繁体   中英

how to query Hive from Spark on CDH 5.4.4

I'm runnning CDH 5.4.4 (which bundles Spark 1.3.0) and would like to read a Hive table into a Spark dataframe.

Looking at the documentation, it suggests that we can do the following:

from pyspark.sql import HiveContext
sqlContext = HiveContext(sc)
results = sqlContext.sql("SHOW TABLES").collect()

... providing that Spark has been built with -Phive and -Phive-thriftserver flags set.

I'm not sure whether Cloudera's build has those flags set.

When I run the snippet, it returns the following error:

15/07/10 16:54:10 WARN HiveMetaStore: Retrying creating default database after error: Error creating transactional connection factory
  javax.jdo.JDOFatalInternalException: Error creating transactional connection factory

I have two questions:

  1. does Cloudera's Spark build have the hive and hive-thriftserver flags set?
  2. what do I need to do to query Hive from Spark?

Update

This almost works:

I created a symlink from $SPARK_HOME/conf/ to hive-site.xml , ie

ln -s /etc/hive/conf.cloudera.hive/hive-site.xml $SPARK_HOME/conf/hive-site.xml

I then restarted the Spark service and was able to access Hive. Unfortunately, the symlink didn't survive a reboot.

copy hive-site.xml from hive conf to spark conf.dist as shown in below,

sudo cp /etc/impala/conf.dist/hive-site.xml /etc/spark/conf.dist/

now try you able to read hive data from pyspark

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM