简体   繁体   English

为什么PySpark创建SparkSession时找不到spark-submit?

[英]Why does PySpark not find spark-submit when creating a SparkSession?

I'm trying to initialize a PySpark cluster with a Jupyter Notebook on my local machine running Linux Mint.我正在尝试在运行 Linux Mint 的本地计算机上使用 Jupyter Notebook 初始化 PySpark 集群。 I am following this tutorial .我正在关注 本教程 When I try to create a SparkSession, I get an error that spark-submit does not exist.当我尝试创建 SparkSession 时,我收到一条错误消息,指出spark-submit不存在。 Strangely, this is the same error I get when I try to get the version of spark-shell without including sudo .奇怪的是,这与我在不包含sudo的情况下尝试获取spark-shell版本时遇到的错误相同。

spark1 = SparkSession.builder.appName('Test').getOrCreate()

FileNotFoundError: [Errno 2] No such file or directory: '~/Spark/spark-3.1.2-bin-hadoop3.2/./bin/spark-submit'

The correct directory for spark-submit is spark-submit的正确目录是

'~/Spark/spark-3.1.2-bin-hadoop3.2/bin/spark-submit' (without the extra ./ , but the former directory should still be valid, right?) '~/Spark/spark-3.1.2-bin-hadoop3.2/bin/spark-submit' (没有额外的./ ,但以前的目录应该仍然有效,对吧?)

I don't know where Spark is getting this directory from, so I don't know where to correct it.我不知道 Spark 从哪里得到这个目录,所以我不知道在哪里更正它。

As mentioned, I cannot even get the version of spark-shell without including sudo :如前所述,如果不包含sudo ,我什至无法获得 spark-shell 的版本:

~/Spark/spark-3.1.2-bin-hadoop3.2/bin$ ./spark-shell --version
./spark-shell: line 60: ~/Spark/spark-3.1.2-bin-hadoop3.2/bin/spark-submit: No such file or directory

~/Spark/spark-3.1.2-bin-hadoop3.2/bin$ ls | grep spark-submit
spark-submit
spark-submit2.cmd
spark-submit.cmd

~/Spark/spark-3.1.2-bin-hadoop3.2/bin$ sudo ./spark-shell --version
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 3.1.2
      /_/
                        
Using Scala version 2.12.10, OpenJDK 64-Bit Server VM, 11.0.11
Branch HEAD
Compiled by user centos on 2021-05-24T04:27:48Z
Revision de351e30a90dd988b133b3d00fa6218bfcaba8b8
Url https://github.com/apache/spark
Type --help for more information.

I tried allowing read, write, and execute permissions to all of the files in ~/Spark with no effect.我尝试授予对~/Spark中所有文件的读取、写入和执行权限,但没有任何效果。 Could this be related to Java permissions?这可能与 Java 权限有关吗?

My .bashrc looks like this:我的.bashrc看起来像这样:

export SPARK_HOME='~/Spark/spark-3.1.2-bin-hadoop3.2'
export PYTHONPATH=$SPARK_HOME/python:$PYTHONPATH
export PYSPARK_DRIVER_PYTHON="jupyter"
export PYSPARK_DRIVER_PYTHON_OPTS="notebook"
export PYSPARK_PYTHON=python3
export PATH=$SPARK_HOME:$PATH:~/.local/bin:$JAVA_HOME/bin:$JAVA_HOME/jre/bin

I'm using Python 3.8 and the Apache Spark 3.1.2 pre-built with Hadoop 3.2.我正在使用 Python 3.8 和 Apache Spark 3.1.2 预构建 Hadoop 3.2。 My Java version is openjdk 11我的Java版本是openjdk 11

Edit: After reinstalling (without modifying the permissions), the files in ~/Spark/spark-3.1.2-bin-hadoop3.2/bin/ are:编辑:重新安装后(不修改权限), ~/Spark/spark-3.1.2-bin-hadoop3.2/bin/中的文件为:

$ ls -al ~/Spark/spark-3.1.2-bin-hadoop3.2/bin
total 124
drwxr-xr-x  2 squid squid  4096 May 23 21:45 .
drwxr-xr-x 13 squid squid  4096 May 23 21:45 ..
-rwxr-xr-x  1 squid squid  1089 May 23 21:45 beeline
-rw-r--r--  1 squid squid  1064 May 23 21:45 beeline.cmd
-rwxr-xr-x  1 squid squid 10965 May 23 21:45 docker-image-tool.sh
-rwxr-xr-x  1 squid squid  1935 May 23 21:45 find-spark-home
-rw-r--r--  1 squid squid  2685 May 23 21:45 find-spark-home.cmd
-rw-r--r--  1 squid squid  2337 May 23 21:45 load-spark-env.cmd
-rw-r--r--  1 squid squid  2435 May 23 21:45 load-spark-env.sh
-rwxr-xr-x  1 squid squid  2634 May 23 21:45 pyspark
-rw-r--r--  1 squid squid  1540 May 23 21:45 pyspark2.cmd
-rw-r--r--  1 squid squid  1170 May 23 21:45 pyspark.cmd
-rwxr-xr-x  1 squid squid  1030 May 23 21:45 run-example
-rw-r--r--  1 squid squid  1223 May 23 21:45 run-example.cmd
-rwxr-xr-x  1 squid squid  3539 May 23 21:45 spark-class
-rwxr-xr-x  1 squid squid  2812 May 23 21:45 spark-class2.cmd
-rw-r--r--  1 squid squid  1180 May 23 21:45 spark-class.cmd
-rwxr-xr-x  1 squid squid  1039 May 23 21:45 sparkR
-rw-r--r--  1 squid squid  1097 May 23 21:45 sparkR2.cmd
-rw-r--r--  1 squid squid  1168 May 23 21:45 sparkR.cmd
-rwxr-xr-x  1 squid squid  3122 May 23 21:45 spark-shell
-rw-r--r--  1 squid squid  1818 May 23 21:45 spark-shell2.cmd
-rw-r--r--  1 squid squid  1178 May 23 21:45 spark-shell.cmd
-rwxr-xr-x  1 squid squid  1065 May 23 21:45 spark-sql
-rw-r--r--  1 squid squid  1118 May 23 21:45 spark-sql2.cmd
-rw-r--r--  1 squid squid  1173 May 23 21:45 spark-sql.cmd
-rwxr-xr-x  1 squid squid  1040 May 23 21:45 spark-submit
-rw-r--r--  1 squid squid  1155 May 23 21:45 spark-submit2.cmd
-rw-r--r--  1 squid squid  1180 May 23 21:45 spark-submit.cmd

Why does 'squid' have ownership of all these files?为什么“鱿鱼”拥有所有这些文件的所有权? Can you set the user/ group ownership to the user that is used to run these submits and therefore needs all these environment variables defined in .bashrc您能否将用户/组所有权设置为用于运行这些提交的用户,因此需要 .bashrc 中定义的所有这些环境变量

This might sound really stupid, but I was having exactly the same problem using a custom Pyspark Kernel for jupyter notebook.这听起来可能真的很愚蠢,但我在为 jupyter 笔记本使用自定义 Pyspark Kernel 时遇到了完全相同的问题。 What solved it was changing the "~" in spark path to "/home/{user}".解决它的方法是将 spark 路径中的“~”更改为“/home/{user}”。 Here's how my Kernel looks like这是我的 Kernel 的样子

{
"display_name": "PySpark",
"language": "python",
"argv": [
    "/usr/bin/python3",
    "-m",
    "ipykernel",
    "-f",
    "{connection_file}"
],
"env": {
    "SPARK_HOME": "/home/rafael/spark-3.2.1-bin-hadoop3.2/",
    "PYTHONPATH": "/home/rafael/spark-3.2.1-bin-hadoop3.2/python/:~/spark-3.2.1-bin-hadoop3.2/python/lib/py4j-0.10.9.3-src.zip",
    "PYTHONSTARTUP": "/home/rafael/spark-3.2.1-bin-hadoop3.2/python/pyspark/shell.py",
    "PYSPARK_SUBMIT_ARGS": "--master local[*] --conf spark.executor.cores=1 --conf spark.executor.memory=512m pyspark-shell"
}

} }

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 为什么这个python代码在pyspark中工作但不在spark-submit中? - Why does this python code work in the pyspark but not in spark-submit? 为什么在Python控制台中对SparkSession.builder..getOrCreate()的调用被视为命令行Spark-Submit? - Why is a call to SparkSession.builder..getOrCreate() in python console being treated like command line spark-submit? 在 PySpark 中创建 SparkSession 时出错 - Error when creating SparkSession in PySpark 为什么在YARN集群模式下的spark-submit不能在执行程序上找到python包? - Why does spark-submit in YARN cluster mode not find python packages on executors? 如何忽略Pyspark的火花提交警告 - How to ignore spark-submit warnings for pyspark 当我使用spark-submit运行我的job.py时,它总是说文件'pyspark.zip'不存在 - when I use spark-submit to run my job.py,it always says the file 'pyspark.zip' does not exist spark-submit 和 pyspark 有什么区别? - What is the difference between spark-submit and pyspark? pyspark:带有spark-submit的罐子依赖项 - pyspark: ship jar dependency with spark-submit spark-submit 期间 pyspark 的 Windows Spark_Home 错误 - Windows Spark_Home error with pyspark during spark-submit 使用pyspark创建sparksession后是否需要停止spark? - Do I need to stop spark after creating sparksession using pyspark?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM