简体   繁体   中英

How to enable pyspark HIVE support on Google Dataproc master node

I created a dataproc cluster and manually install conda and Jupyter notebook. Then, I install pyspark by conda. I can successfully run spark by

from pyspark import SparkSession
sc = SparkContext(appName="EstimatePi")

However, I cannot enable HIVE support. The following code gets stucked and doesn't return anything.

from pyspark.sql import SparkSession
spark = (SparkSession.builder
         .config('spark.driver.memory', '2G')
         .config("spark.kryoserializer.buffer.max", "2000m")
         .enableHiveSupport()
         .getOrCreate())

Python version 2.7.13, Spark version 2.3.4

Any way to enable HIVE support?

I do not recommend manually installing pyspark . When you do this, you get a new spark/pyspark installation that is different from Dataproc's own and do not get the configuration/tuning/classpath/etc. This is likely the reason Hive support does not work.

To get conda with properly configured pyspark I suggest selecting ANACONDA and JUPYTER optional components on image 1.3 (the default) or later.

Additionally, on 1.4 and later images Mini-Conda is the default user Python with pyspark preconfigured. You can pip/conda install Jupyter on your own if you wish.

See https://cloud.google.com/dataproc/docs/tutorials/python-configuration

Also as @Jayadeep Jayaraman points out, Jupyter optional component works with Component Gateway which means you can use it from a link in Developers Console as opposed to opening ports to the world or SSH tunneling.

tl/dr : I recomment these flags for your next cluster: --optional-components ANACONDA,JUPYTER --enable-component-gateway

Cloud Dataproc now has the option to install optional components in the dataproc cluster and also has an easy way of accessing them via the Gateway. You can find details of installing Jupyter and Conda here - https://cloud.google.com/dataproc/docs/tutorials/jupyter-notebook

The details of the component gateway can be found here - https://cloud.google.com/dataproc/docs/concepts/accessing/dataproc-gateways . Note that this is Alpha.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM