I'm runnning CDH 5.4.4 (which bundles Spark 1.3.0) and would like to read a Hive table into a Spark dataframe.
Looking at the documentation, it suggests that we can do the following:
from pyspark.sql import HiveContext
sqlContext = HiveContext(sc)
results = sqlContext.sql("SHOW TABLES").collect()
... providing that Spark has been built with -Phive
and -Phive-thriftserver
flags set.
I'm not sure whether Cloudera's build has those flags set.
When I run the snippet, it returns the following error:
15/07/10 16:54:10 WARN HiveMetaStore: Retrying creating default database after error: Error creating transactional connection factory
javax.jdo.JDOFatalInternalException: Error creating transactional connection factory
I have two questions:
hive
and hive-thriftserver
flags set? Update
This almost works:
I created a symlink from $SPARK_HOME/conf/
to hive-site.xml
, ie
ln -s /etc/hive/conf.cloudera.hive/hive-site.xml $SPARK_HOME/conf/hive-site.xml
I then restarted the Spark service and was able to access Hive. Unfortunately, the symlink didn't survive a reboot.
copy hive-site.xml from hive conf to spark conf.dist as shown in below,
sudo cp /etc/impala/conf.dist/hive-site.xml /etc/spark/conf.dist/
now try you able to read hive data from pyspark
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.