how to query Hive from Spark on CDH 5.4.4

Question

I'm runnning CDH 5.4.4 (which bundles Spark 1.3.0) and would like to read a Hive table into a Spark dataframe.

Looking at the documentation, it suggests that we can do the following:

from pyspark.sql import HiveContext
sqlContext = HiveContext(sc)
results = sqlContext.sql("SHOW TABLES").collect()

... providing that Spark has been built with -Phive and -Phive-thriftserver flags set.

I'm not sure whether Cloudera's build has those flags set.

When I run the snippet, it returns the following error:

15/07/10 16:54:10 WARN HiveMetaStore: Retrying creating default database after error: Error creating transactional connection factory
  javax.jdo.JDOFatalInternalException: Error creating transactional connection factory

I have two questions:

does Cloudera's Spark build have the hive and hive-thriftserver flags set?
what do I need to do to query Hive from Spark?

Update

This almost works:

I created a symlink from $SPARK_HOME/conf/ to hive-site.xml , ie

ln -s /etc/hive/conf.cloudera.hive/hive-site.xml $SPARK_HOME/conf/hive-site.xml

I then restarted the Spark service and was able to access Hive. Unfortunately, the symlink didn't survive a reboot.

Answer 1

copy hive-site.xml from hive conf to spark conf.dist as shown in below,

sudo cp /etc/impala/conf.dist/hive-site.xml /etc/spark/conf.dist/

now try you able to read hive data from pyspark

how to query Hive from Spark on CDH 5.4.4

Question

1 answers

solution1
0 2018-03-04 18:59:40

how to query Hive from Spark on CDH 5.4.4

Question

1 answers

solution1 0 2018-03-04 18:59:40

solution1
0 2018-03-04 18:59:40