如何在 Google Dataproc 主节点上启用 pyspark HIVE 支持

Question

I created a dataproc cluster and manually install conda and Jupyter notebook.我创建了一个 dataproc 集群并手动安装 conda 和 Jupyter notebook。 Then, I install pyspark by conda.然后，我通过 conda 安装 pyspark。 I can successfully run spark by我可以成功运行火花

from pyspark import SparkSession
sc = SparkContext(appName="EstimatePi")

However, I cannot enable HIVE support.但是，我无法启用 HIVE 支持。 The following code gets stucked and doesn't return anything.下面的代码被卡住并且不返回任何内容。

from pyspark.sql import SparkSession
spark = (SparkSession.builder
         .config('spark.driver.memory', '2G')
         .config("spark.kryoserializer.buffer.max", "2000m")
         .enableHiveSupport()
         .getOrCreate())

Python version 2.7.13, Spark version 2.3.4 Python 2.7.13 版，Spark 2.3.4 版

Any way to enable HIVE support?有什么方法可以启用 HIVE 支持？

Answer 1

I do not recommend manually installing pyspark .我不建议手动安装pyspark 。 When you do this, you get a new spark/pyspark installation that is different from Dataproc's own and do not get the configuration/tuning/classpath/etc.当你这样做时，你会得到一个新的 spark/pyspark 安装，它不同于 Dataproc 自己的，并且没有配置/调整/类路径/等。 This is likely the reason Hive support does not work.这可能是 Hive 支持不起作用的原因。

To get conda with properly configured pyspark I suggest selecting ANACONDA and JUPYTER optional components on image 1.3 (the default) or later.要使用正确配置的 pyspark 获得 conda，我建议在图像1.3 （默认）或更高版本上选择ANACONDA和JUPYTER可选组件。

Additionally, on 1.4 and later images Mini-Conda is the default user Python with pyspark preconfigured.此外，在1.4及更高版本的映像上，Mini-Conda 是预配置 pyspark 的默认用户 Python。 You can pip/conda install Jupyter on your own if you wish.如果您愿意，您可以通过 pip/conda 自行安装 Jupyter。

See https://cloud.google.com/dataproc/docs/tutorials/python-configuration请参阅https://cloud.google.com/dataproc/docs/tutorials/python-configuration

Also as @Jayadeep Jayaraman points out, Jupyter optional component works with Component Gateway which means you can use it from a link in Developers Console as opposed to opening ports to the world or SSH tunneling.同样正如@Jayadeep Jayaraman 指出的那样，Jupyter 可选组件与组件网关一起使用，这意味着您可以从开发人员控制台中的链接使用它，而不是向世界开放端口或 SSH 隧道。

tl/dr : I recomment these flags for your next cluster: --optional-components ANACONDA,JUPYTER --enable-component-gateway tl/dr ：我为您的下一个集群推荐这些标志：-- --optional-components ANACONDA,JUPYTER --enable-component-gateway

Answer 2

Cloud Dataproc now has the option to install optional components in the dataproc cluster and also has an easy way of accessing them via the Gateway. Cloud Dataproc 现在可以选择在 dataproc 集群中安装可选组件，并且还可以通过网关轻松访问它们。 You can find details of installing Jupyter and Conda here - https://cloud.google.com/dataproc/docs/tutorials/jupyter-notebook您可以在此处找到安装 Jupyter 和 Conda 的详细信息 - https://cloud.google.com/dataproc/docs/tutorials/jupyter-notebook

The details of the component gateway can be found here - https://cloud.google.com/dataproc/docs/concepts/accessing/dataproc-gateways .可以在此处找到组件网关的详细信息 - https://cloud.google.com/dataproc/docs/concepts/accessing/dataproc-gateways 。 Note that this is Alpha.请注意，这是阿尔法。

如何在 Google Dataproc 主节点上启用 pyspark HIVE 支持

问题描述

2 个解决方案

解决方案1
2 已采纳 2020-01-09 20:45:39

解决方案2
2 2020-01-10 03:57:36

如何在 Google Dataproc 主节点上启用 pyspark HIVE 支持

问题描述

2 个解决方案

解决方案1 2 已采纳 2020-01-09 20:45:39

解决方案2 2 2020-01-10 03:57:36

解决方案1
2 已采纳 2020-01-09 20:45:39

解决方案2
2 2020-01-10 03:57:36