[英]Error downloading Jars related to Delta Lake from pyspark packages
我正在尝试使用 pyspark 和 delta 湖在 docker 中设置本地开发环境。
我在这里完成了 delta Lake 和 spark 之间版本的兼容性。
我的 Pipfile 中有以下内容(使用 pipenv)
pyspark = {version = "==3.2.2", index = "artifactory-pypi"}
delta-spark={version= "==2.0.0", index = "artifactory-pypi"}
pytest = {version = "==7.1.2", index = "artifactory-pypi"}
pytest-cov ={version= "==3.0.0", index = "artifactory-pypi"}
...other packages
artifactory-pypi 是 pypi 的镜像。
我已经完成了这个并尝试设置一个 python 项目进行单元测试。 代码已经有了这个
_builder = (
SparkSession.builder.master("local[1]")
.config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension")
.config(
"spark.sql.catalog.spark_catalog",
"org.apache.spark.sql.delta.catalog.DeltaCatalog",
)
)
spark: SparkSession = configure_spark_with_delta_pip(_builder).getOrCreate()
当我尝试使用 pytest 运行单元测试时,它总是失败
configure_spark_with_delta_pip(_builder).getOrCreate()
出现无法连接到 maven repo 下载的错误
delta-core_2.12;2.0.0
我不太熟悉 Java,但我已经做了一些挖掘以了解
/usr/local/lib/python3.9/site-packages/pyspark/jars/
文件夹中有一个ivy2.jar
文件,该文件显然包含有关需要什么 jars 的信息,并且它试图访问 maven 坐标。 此连接被拒绝,因为我在公司代理后面。 我试过设置
ENV HTTP_PROXY=proxyurl
ENV HTTPS_PROXY=proxyurl
根据这个 SO post , Maven 不尊重 http_proxy 环境变量,答案表明 OP 在某些配置文件中设置了这些变量。 但是在那里,OP 使用的是 Maven 图像,因此已经有了这些 conf 文件。 我没有这样的文件或文件夹,因为我只是使用 python 图像。 只是那些python幕后的图像go和下载jars。
我还尝试查看spark 运行时环境变量,尤其是spark.jars.repositories
以查看是否可以在我的docker-compose.yml
中设置它们,但事件不起作用。
我怎样才能让它工作? 有人可以建议吗
顺便说一句,这是完整的堆栈跟踪
:: loading settings :: url = jar:file:/usr/local/lib/python3.9/site-packages/pyspark/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
Ivy Default Cache set to: /root/.ivy2/cache
The jars for the packages stored in: /root/.ivy2/jars
io.delta#delta-core_2.12 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-63bca6fa-4932-4c6d-b13f-4a339629fc26;1.0
confs: [default]
:: resolution report :: resolve 84422ms :: artifacts dl 0ms
:: modules in use:
---------------------------------------------------------------------
| | modules || artifacts |
| conf | number| search|dwnlded|evicted|| number|dwnlded|
---------------------------------------------------------------------
| default | 1 | 0 | 0 | 0 || 0 | 0 |
---------------------------------------------------------------------
:: problems summary ::
:::: WARNINGS
module not found: io.delta#delta-core_2.12;2.0.0
==== local-m2-cache: tried
file:/root/.m2/repository/io/delta/delta-core_2.12/2.0.0/delta-core_2.12-2.0.0.pom
-- artifact io.delta#delta-core_2.12;2.0.0!delta-core_2.12.jar:
file:/root/.m2/repository/io/delta/delta-core_2.12/2.0.0/delta-core_2.12-2.0.0.jar
==== local-ivy-cache: tried
/root/.ivy2/local/io.delta/delta-core_2.12/2.0.0/ivys/ivy.xml
-- artifact io.delta#delta-core_2.12;2.0.0!delta-core_2.12.jar:
/root/.ivy2/local/io.delta/delta-core_2.12/2.0.0/jars/delta-core_2.12.jar
==== central: tried
https://repo1.maven.org/maven2/io/delta/delta-core_2.12/2.0.0/delta-core_2.12-2.0.0.pom
-- artifact io.delta#delta-core_2.12;2.0.0!delta-core_2.12.jar:
https://repo1.maven.org/maven2/io/delta/delta-core_2.12/2.0.0/delta-core_2.12-2.0.0.jar
==== spark-packages: tried
https://repos.spark-packages.org/io/delta/delta-core_2.12/2.0.0/delta-core_2.12-2.0.0.pom
-- artifact io.delta#delta-core_2.12;2.0.0!delta-core_2.12.jar:
https://repos.spark-packages.org/io/delta/delta-core_2.12/2.0.0/delta-core_2.12-2.0.0.jar
::::::::::::::::::::::::::::::::::::::::::::::
:: UNRESOLVED DEPENDENCIES ::
::::::::::::::::::::::::::::::::::::::::::::::
:: io.delta#delta-core_2.12;2.0.0: not found
::::::::::::::::::::::::::::::::::::::::::::::
:::: ERRORS
Server access error at url https://repo1.maven.org/maven2/io/delta/delta-core_2.12/2.0.0/delta-core_2.12-2.0.0.pom (java.net.ConnectException: Connection refused (Connection refused))
Server access error at url https://repo1.maven.org/maven2/io/delta/delta-core_2.12/2.0.0/delta-core_2.12-2.0.0.jar (java.net.ConnectException: Connection refused (Connection refused))
Server access error at url https://repos.spark-packages.org/io/delta/delta-core_2.12/2.0.0/delta-core_2.12-2.0.0.pom (java.net.ConnectException: Connection refused (Connection refused))
Server access error at url https://repos.spark-packages.org/io/delta/delta-core_2.12/2.0.0/delta-core_2.12-2.0.0.jar (java.net.ConnectException: Connection refused (Connection refused))
:: USE VERBOSE OR DEBUG MESSAGE LEVEL FOR MORE DETAILS
Exception in thread "main" java.lang.RuntimeException: [unresolved dependency: io.delta#delta-core_2.12;2.0.0: not found]
任何问题,我可以尝试详细说明。
感谢 SO 社区,我已经能够解决这个问题——基本上是通过组合来自多个 SO 帖子和答案的信息。
所以,这就是:
ENV HTTP_PROXY=proxyurl
ENV HTTPS_PROXY=proxyurl
所以,结合所有这些,这就是我在 dockerfile 中所做的
ENV HTTP_PROXY=proxyurl
ENV HTTPS_PROXY=proxyurl
WORKDIR /usr/local/lib/python3.9/site-packages/pyspark
RUN mkdir conf
WORKDIR /usr/local/lib/python3.9/site-packages/pyspark/conf
RUN echo "spark.driver.extraJavaOptions=-Djava.net.useSystemProxies=true" > spark-defaults.conf
ENV SPARK_HOME=/usr/local/lib/python3.9/site-packages/pyspark
ENV SPARK_PYTHON=python3
Other app specific stuffs
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.