[英]Error downloading Jars related to Delta Lake from pyspark packages
I am trying to set up a local dev environment in docker with pyspark and delta lake.我正在尝试使用 pyspark 和 delta 湖在 docker 中设置本地开发环境。
I have gone through the compatibility of versions between delta lake and spark here .我在这里完成了 delta Lake 和 spark 之间版本的兼容性。
I have the following in my Pipfile (using pipenv)我的 Pipfile 中有以下内容(使用 pipenv)
pyspark = {version = "==3.2.2", index = "artifactory-pypi"}
delta-spark={version= "==2.0.0", index = "artifactory-pypi"}
pytest = {version = "==7.1.2", index = "artifactory-pypi"}
pytest-cov ={version= "==3.0.0", index = "artifactory-pypi"}
...other packages
The artifactory-pypi is a mirror of pypi. artifactory-pypi 是 pypi 的镜像。
I have gone through this and trying to setup a python project for unit testing.我已经完成了这个并尝试设置一个 python 项目进行单元测试。 The code already has this
代码已经有了这个
_builder = (
SparkSession.builder.master("local[1]")
.config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension")
.config(
"spark.sql.catalog.spark_catalog",
"org.apache.spark.sql.delta.catalog.DeltaCatalog",
)
)
spark: SparkSession = configure_spark_with_delta_pip(_builder).getOrCreate()
When I try to run my unit tests using pytest, it always fails at当我尝试使用 pytest 运行单元测试时,它总是失败
configure_spark_with_delta_pip(_builder).getOrCreate()
with an error that it can't connect to maven repo to download出现无法连接到 maven repo 下载的错误
delta-core_2.12;2.0.0
I am not very well versed with Java, but I have done some digging to see that within我不太熟悉 Java,但我已经做了一些挖掘以了解
/usr/local/lib/python3.9/site-packages/pyspark/jars/
folder there is a ivy2.jar
file which apparently has info on what jars are needed and it tries to reach out to maven coordinates.文件夹中有一个
ivy2.jar
文件,该文件显然包含有关需要什么 jars 的信息,并且它试图访问 maven 坐标。 This connection is refused as I am behind a corporate proxy.此连接被拒绝,因为我在公司代理后面。 I have tried setting
我试过设置
ENV HTTP_PROXY=proxyurl
ENV HTTPS_PROXY=proxyurl
According to this SO post , Maven does not honor http_proxy env variable and the answer suggested that the OP set those in some configuration files.根据这个 SO post , Maven 不尊重 http_proxy 环境变量,答案表明 OP 在某些配置文件中设置了这些变量。 But in there the OP was using a Maven image and thus had those conf files already.
但是在那里,OP 使用的是 Maven 图像,因此已经有了这些 conf 文件。 I do not have such files or folders as I am just using a python image.
我没有这样的文件或文件夹,因为我只是使用 python 图像。 It is just that those python images behind the scenes go and download jars.
只是那些python幕后的图像go和下载jars。
I have also tried looking at spark runtime env variables especially spark.jars.repositories
to see if I could set them in my docker-compose.yml
but event that didn't work.我还尝试查看spark 运行时环境变量,尤其是
spark.jars.repositories
以查看是否可以在我的docker-compose.yml
中设置它们,但事件不起作用。
How can I get this to work?我怎样才能让它工作? Can someone suggest either
有人可以建议吗
Btw, here is the full stack trace顺便说一句,这是完整的堆栈跟踪
:: loading settings :: url = jar:file:/usr/local/lib/python3.9/site-packages/pyspark/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
Ivy Default Cache set to: /root/.ivy2/cache
The jars for the packages stored in: /root/.ivy2/jars
io.delta#delta-core_2.12 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-63bca6fa-4932-4c6d-b13f-4a339629fc26;1.0
confs: [default]
:: resolution report :: resolve 84422ms :: artifacts dl 0ms
:: modules in use:
---------------------------------------------------------------------
| | modules || artifacts |
| conf | number| search|dwnlded|evicted|| number|dwnlded|
---------------------------------------------------------------------
| default | 1 | 0 | 0 | 0 || 0 | 0 |
---------------------------------------------------------------------
:: problems summary ::
:::: WARNINGS
module not found: io.delta#delta-core_2.12;2.0.0
==== local-m2-cache: tried
file:/root/.m2/repository/io/delta/delta-core_2.12/2.0.0/delta-core_2.12-2.0.0.pom
-- artifact io.delta#delta-core_2.12;2.0.0!delta-core_2.12.jar:
file:/root/.m2/repository/io/delta/delta-core_2.12/2.0.0/delta-core_2.12-2.0.0.jar
==== local-ivy-cache: tried
/root/.ivy2/local/io.delta/delta-core_2.12/2.0.0/ivys/ivy.xml
-- artifact io.delta#delta-core_2.12;2.0.0!delta-core_2.12.jar:
/root/.ivy2/local/io.delta/delta-core_2.12/2.0.0/jars/delta-core_2.12.jar
==== central: tried
https://repo1.maven.org/maven2/io/delta/delta-core_2.12/2.0.0/delta-core_2.12-2.0.0.pom
-- artifact io.delta#delta-core_2.12;2.0.0!delta-core_2.12.jar:
https://repo1.maven.org/maven2/io/delta/delta-core_2.12/2.0.0/delta-core_2.12-2.0.0.jar
==== spark-packages: tried
https://repos.spark-packages.org/io/delta/delta-core_2.12/2.0.0/delta-core_2.12-2.0.0.pom
-- artifact io.delta#delta-core_2.12;2.0.0!delta-core_2.12.jar:
https://repos.spark-packages.org/io/delta/delta-core_2.12/2.0.0/delta-core_2.12-2.0.0.jar
::::::::::::::::::::::::::::::::::::::::::::::
:: UNRESOLVED DEPENDENCIES ::
::::::::::::::::::::::::::::::::::::::::::::::
:: io.delta#delta-core_2.12;2.0.0: not found
::::::::::::::::::::::::::::::::::::::::::::::
:::: ERRORS
Server access error at url https://repo1.maven.org/maven2/io/delta/delta-core_2.12/2.0.0/delta-core_2.12-2.0.0.pom (java.net.ConnectException: Connection refused (Connection refused))
Server access error at url https://repo1.maven.org/maven2/io/delta/delta-core_2.12/2.0.0/delta-core_2.12-2.0.0.jar (java.net.ConnectException: Connection refused (Connection refused))
Server access error at url https://repos.spark-packages.org/io/delta/delta-core_2.12/2.0.0/delta-core_2.12-2.0.0.pom (java.net.ConnectException: Connection refused (Connection refused))
Server access error at url https://repos.spark-packages.org/io/delta/delta-core_2.12/2.0.0/delta-core_2.12-2.0.0.jar (java.net.ConnectException: Connection refused (Connection refused))
:: USE VERBOSE OR DEBUG MESSAGE LEVEL FOR MORE DETAILS
Exception in thread "main" java.lang.RuntimeException: [unresolved dependency: io.delta#delta-core_2.12;2.0.0: not found]
Any questions and I can try to elaborate.任何问题,我可以尝试详细说明。
Thanks to the SO community, I have been able to solve this- basically by combining information from multiple SO posts and answers.感谢 SO 社区,我已经能够解决这个问题——基本上是通过组合来自多个 SO 帖子和答案的信息。
So, this is how:所以,这就是:
ENV HTTP_PROXY=proxyurl
ENV HTTPS_PROXY=proxyurl
So, combining all of them, this was how I did it in my dockerfile所以,结合所有这些,这就是我在 dockerfile 中所做的
ENV HTTP_PROXY=proxyurl
ENV HTTPS_PROXY=proxyurl
WORKDIR /usr/local/lib/python3.9/site-packages/pyspark
RUN mkdir conf
WORKDIR /usr/local/lib/python3.9/site-packages/pyspark/conf
RUN echo "spark.driver.extraJavaOptions=-Djava.net.useSystemProxies=true" > spark-defaults.conf
ENV SPARK_HOME=/usr/local/lib/python3.9/site-packages/pyspark
ENV SPARK_PYTHON=python3
Other app specific stuffs
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.