简体   繁体   English

如何在Jupyter上的HDInsight Spark集群上提交python wordcount

[英]How to submit a python wordcount on HDInsight Spark cluster from Jupyter

I am trying to run a python wordcount on a Spark HDInsight cluster and I'm running it from Jupyter. 我试图在Spark HDInsight集群上运行python wordcount,我正在从Jupyter运行它。 I'm not actually sure if this is the right way to do it, but I couldn't find anything helpful about how to submit a standalone python app on HDInsight Spark cluster. 我不确定这是否是正确的方法,但我找不到任何有关如何在HDInsight Spark集群上提交独立python应用程序的帮助。

The code : 代码 :

import pyspark
import operator
from pyspark import SparkConf
from pyspark import SparkContext
import atexit
from operator import add
conf = SparkConf().setMaster("yarn-client").setAppName("WC")
sc = SparkContext(conf = conf)
atexit.register(lambda: sc.stop())

input = sc.textFile("wasb:///example/data/gutenberg/davinci.txt")
words = input.flatMap(lambda x: x.split())
wordCount = words.map(lambda x: (str(x),1)).reduceByKey(add)

wordCount.saveAsTextFile("wasb:///example/outputspark")

And the error message I get and don't understand : 并且我得到并且不明白的错误消息:

ValueError                                Traceback (most recent call last)
<ipython-input-2-8a9d4f2cb5e8> in <module>()
      6 from operator import add
      7 import atexit
----> 8 sc = SparkContext('yarn-client')
      9 
     10 input = sc.textFile("wasb:///example/data/gutenberg/davinci.txt")

/usr/hdp/current/spark-client/python/pyspark/context.pyc in __init__(self, master, appName, sparkHome, pyFiles, environment, batchSize, serializer, conf, gateway, jsc, profiler_cls)
    108         """
    109         self._callsite = first_spark_call() or CallSite(None, None, None)
--> 110         SparkContext._ensure_initialized(self, gateway=gateway)
    111         try:
    112             self._do_init(master, appName, sparkHome, pyFiles, environment, batchSize, serializer,

/usr/hdp/current/spark-client/python/pyspark/context.pyc in _ensure_initialized(cls, instance, gateway)
    248                         " created by %s at %s:%s "
    249                         % (currentAppName, currentMaster,
--> 250                             callsite.function, callsite.file, callsite.linenum))
    251                 else:
    252                     SparkContext._active_spark_context = instance

ValueError: Cannot run multiple SparkContexts at once; existing SparkContext(app=pyspark-shell, master=yarn-client) created by __init__ at <ipython-input-1-86beedbc8a46>:7 

Is it actually possible to run python job this way? 实际上可以用这种方式运行python作业吗? If yes - it seems to be the problem with SparkContext definition... I tried different ways: 如果是 - 似乎是SparkContext定义的问题......我尝试了不同的方法:

sc = SparkContext('spark://headnodehost:7077', 'pyspark')

and

conf = SparkConf().setMaster("yarn-client").setAppName("WordCount1")
sc = SparkContext(conf = conf)

but no success. 但没有成功。 What would be the right way to run the job or configure SparkContext? 什么是运行作业或配置SparkContext的正确方法?

If you running from Jupyter notebook than Spark context is pre-created for you and it would be incorrect to create a separate context. 如果从Jupyter笔记本运行,则为您预先创建Spark上下文,创建单独的上下文将不正确。 To resolve the problem just remove the lines that create the context and directly start from: 要解决此问题,只需删除创建上下文的行并直接从以下位置开始:

input = sc.textFile("wasb:///example/data/gutenberg/davinci.txt")

If you need to run standalone program you can run it from command line using pyspark or submit it using REST APIs using Livy server running on the cluster. 如果需要运行独立程序,可以使用pyspark从命令行运行它,或者使用在集群上运行的Livy服务器使用REST API提交它。

It looks like I can answer my question myself. 看起来我自己可以回答我的问题。 Some changes in the code turned out to be helpful: 代码中的一些更改证明是有用的:

conf = SparkConf()
conf.setMaster("yarn-client")
conf.setAppName("pyspark-word-count6")
sc = SparkContext(conf=conf)
atexit.register(lambda: sc.stop())

data = sc.textFile("wasb:///example/data/gutenberg/davinci.txt")
words = data.flatMap(lambda x: x.split())
wordCount = words.map(lambda x: (x.encode('ascii','ignore'),1)).reduceByKey(add)

wordCount.saveAsTextFile("wasb:///output/path")

I just resolved a similar bug in my code to find it was down to the fact that pyspark only accepts one object from SparkContext(). 我刚刚在我的代码中解决了一个类似的错误,发现它只是因为pyspark只接受来自SparkContext()的一个对象。 Once submitted, any changes and running of code will hit that problem and return the error message initialisation. 提交后,任何更改和代码运行都会遇到该问题并返回错误消息初始化。 My solution was simply to restart the platform kernel and when the notebook is restarted to rerun my notebook script. 我的解决方案只是重新启动平台内核,并在重新启动笔记本时重新运行我的笔记本脚本。 It then ran without error. 然后它运行没有错误。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM