简体   繁体   English

将应用程序提交到从Python笔记本在GCP中运行的独立Spark集群

[英]Submit an application to a standalone spark cluster running in GCP from Python notebook

I am trying to submit a spark application to a standalone spark(2.1.1) cluster 3 VM running in GCP from my Python 3 notebook(running in local laptop) but for some reason spark session is throwing error "StandaloneAppClient$ClientEndpoint: Failed to connect to master sparkmaster:7077". 我正在尝试将Spark应用程序提交给从我的Python 3笔记本(在本地笔记本电脑中运行)在GCP中运行的独立spark(2.1.1)群集3 VM,但是由于某种原因,spark会话抛出错误“ StandaloneAppClient $ ClientEndpoint:失败连接至主sparkmaster:7077”。

Environment Details: IPython and Spark Master are running in one GCP VM called "sparkmaster". 环境详细信息:IPython和Spark Master在一个名为“ sparkmaster”的GCP VM中运行。 3 additional GCP VMs are running Spark Slaves and Cassandra Clusters. 另外3个GCP VM正在运行Spark从站和Cassandra群集。 I connect from my local laptop(MBP) using Chrome to GCP VM IPython notebook in "sparkmaster" 我使用Chrome从本地笔记本电脑(MBP)连接到“ sparkmaster”中的GCP VM IPython笔记本电脑

Please note that terminal works: 请注意,终端可以工作:

bin/spark-submit --packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.1.1 --master spark://sparkmaster:7077 ex.py 1000

Running it from Python Notebook: 从Python Notebook运行它:

import os
os.environ["PYSPARK_SUBMIT_ARGS"] = '--packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.1.1 pyspark-shell'

from pyspark.sql import SparkSession
from pyspark.sql.functions import *

spark=SparkSession.builder.master("spark://sparkmaster:7077").appName('somatic').getOrCreate() #This step works if make .master('local')

df = spark \
  .readStream \
  .format("kafka") \
  .option("kafka.bootstrap.servers", "kafka1:9092,kafka2:9092,kafka3:9092") \
  .option("subscribe", "gene") \
  .load()

so far I have tried these: 到目前为止,我已经尝试过这些:

  1. I have tried to change spark master node spark-defaults.conf and spark-env.sh to add SPARK_MASTER_IP. 我试图更改spark主节点spark-defaults.conf和spark-env.sh以添加SPARK_MASTER_IP。

  2. Tried to find the STANDALONE_SPARK_MASTER_HOST=hostname -f setting so that I can remove "-f". 试图找到STANDALONE_SPARK_MASTER_HOST =主机名-f设置,以便可以删除“ -f”。 For some reason my spark master ui shows FQDN:7077 not hostname:7077 由于某种原因,我的Spark Master用户界面显示FQDN:7077而不是主机名:7077

  3. passed FQDN as param to .master() and os.environ["PYSPARK_SUBMIT_ARGS"] 将FQDN作为参数传递给.master()和os.environ [“ PYSPARK_SUBMIT_ARGS”]

Please let me know if you need more details. 如果您需要更多详细信息,请告诉我。

After doing some more research I was able to resolve the conflict. 经过更多研究后,我得以解决冲突。 It was due to a simple environment variable called SPARK_HOME. 这是由于有一个称为SPARK_HOME的简单环境变量。 In my case it was pointing to Conda's /bin(pyspark was running from this location) whereas my spark setup was present in a diff. 在我的情况下,它指向的是Conda的/ bin(pyspark从该位置运行),而我的火花设置出现在diff中。 path. 路径。 The simple fix was to add export SPARK_HOME="/home/<>/spark/" to .bashrc file( I want this to be attached to my profile not to the spark session) 简单的解决方法是将export SPARK_HOME =“ / home / <> / spark /”添加到.bashrc文件中(我希望将其附加到我的配置文件中而不是spark会话中)

How I have done it: 我是如何做到的:

Step 1: ssh to master node in my case it was same as ipython kernel/server VM in GCP 第1步:在我的情况下,SSH为主节点,与GCP中的ipython内核/服务器VM相同

Step 2: 第2步:

  • cd ~ 光盘〜
  • sudo nano .bashrc 须藤纳米.bashrc
  • scroll down to the last line and paste the below line 向下滚动到最后一行并粘贴下一行
  • export SPARK_HOME="/home/your/path/to/spark-2.1.1-bin-hadoop2.7/" 出口SPARK_HOME =“ / home / your / path / to / spark-2.1.1-bin-hadoop2.7 /”
  • ctrlX and Y and enter to save the changes ctrlX和Y并输入以保存更改

Note: I have also added few more details to the environment section for clarity. 注意:为了清楚起见,我还向环境部分添加了更多详细信息。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM