[英]Importing Spark in a Jupyter Notebook
I was following this guide and found that it used a variable named sc, which I figured is the Spark library. 我遵循此指南 ,发现它使用了一个名为sc的变量,我认为这是Spark库。
I tried to install Spark using this guide , though I'm not sure if it installed correctly. 我尝试使用本指南安装Spark,但不确定它是否安装正确。 Now when I try to import PySpark in the notebook, it's not recognized.
现在,当我尝试在笔记本中导入PySpark时,无法识别它。
I'm on windows, what should I do? 我在Windows上,该怎么办?
"I'm not sure if [Spark is] installed correctly ... on windows"
“我不确定 在Windows上是否正确安装了 [Spark] 。”
What happens if you run pyspark.cmd
? 如果运行
pyspark.cmd
会发生什么? It should display something like that, and wait for you to enter commands: 它应该显示类似的内容,并等待您输入命令:
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/__ / .__/\_,_/_/ /_/\_\ version 2.1.1
/_/
Using Python version 3.6.0 (default, Dec 23 2016 12:22:00)
SparkSession available as 'spark'.
>>>
At this point you can type spark
and/or sc
to check whether you have an implicit SparkSession (V2 only) and/or SparkContext (deprecated in V2 but still there for compatibility) object(s). 此时,您可以键入
spark
和/或sc
来检查您是否具有隐式SparkSession (仅V2)和/或SparkContext (在V2中已弃用,但仍出于兼容性考虑)对象。
But if you see a swarm of error messages and no sc
object has been created, then you must enable verbose logging by adding this property at the end of your log4j.properties
file... 但是,如果看到大量错误消息并且未创建任何
sc
对象,则必须通过在log4j.properties
文件的末尾添加此属性来启用详细日志记录...
log4j.logger.org.apache.spark.repl.Main=DEBUG
...and enjoy your next 5 days of debugging. ...并享受接下来的5天调试时间。
Also, read carefully https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-tips-and-tricks-running-spark-windows.html 另外,请仔细阅读https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-tips-and-tricks-running-spark-windows.html
when I try to import PySpark in the notebook, it's not recognized
当我尝试在笔记本中导入PySpark时,无法识别
The proper way to "define" PySpark in Jupyter is to create a kernel configuration file, in one of the default locations (cf. http://jupyter-client.readthedocs.io/en/latest/kernels.html ) 在Jupyter中“定义” PySpark的正确方法是在默认位置之一中创建内核配置文件(请参见http://jupyter-client.readthedocs.io/en/latest/kernels.html )
To define a new kernel, then create a sub-directory (the name is not important) with inside a file named kernel.json
(that name exactly) that looks like... 要定义一个新内核,然后在名为
kernel.json
(确切名称)的文件中创建一个子目录(名称不重要) ,该文件看起来像...
{ "display_name": "PySpark 2.1.1",
"language": "python",
"argv": [
"/usr/local/bin/python3",
"-m", "ipykernel",
"-f", "{connection_file}" ],
"env": {
"SPARK_HOME": "/opt/spark/spark-2.1.1-bin-hadoop2.7",
"PYTHONPATH": "/opt/spark/spark-2.1.1-bin-hadoop2.7/python/:/opt/spark/spark-2.1.1-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip",
"PYTHONSTARTUP": "/opt/spark/spark-2.1.1-bin-hadoop2.7/python/pyspark/shell.py",
"PYSPARK_SUBMIT_ARGS": "pyspark-shell" }
}
That's an example taken from a Linux box; 这是从Linux机器上获得的一个例子。 now it's your job to adapt the paths to your actual Windows install, and adapt the Py4J version if required.
现在,您的工作就是将路径调整为实际的Windows安装,并根据需要调整Py4J版本。
Note that you can stuff additional parameters in PYSPARK_SUBMIT_ARGS
, to override spark-defaults.conf
, eg 请注意,您可以在
PYSPARK_SUBMIT_ARGS
填充其他参数,以覆盖spark-defaults.conf
,例如
"PYSPARK_SUBMIT_ARGS": "--master local[4] --conf spark.driver.memory=4G --conf spark.python.worker.memory=512M --conf spark.rdd.compress=true --conf spark.serializer=org.apache.spark.serializer.KryoSerializer --conf spark.local.dir=C:/TEMP/spark --conf spark.driver.extraClassPath=C:/path/to/myjdbcdriver.jar pyspark-shell"
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.