简体   繁体   English

从 pyspark 包下载与 Delta Lake 相关的 Jars 时出错

[英]Error downloading Jars related to Delta Lake from pyspark packages

I am trying to set up a local dev environment in docker with pyspark and delta lake.我正在尝试使用 pyspark 和 delta 湖在 docker 中设置本地开发环境。

I have gone through the compatibility of versions between delta lake and spark here .我在这里完成了 delta Lake 和 spark 之间版本的兼容性。

I have the following in my Pipfile (using pipenv)我的 Pipfile 中有以下内容(使用 pipenv)

pyspark = {version = "==3.2.2", index = "artifactory-pypi"}
delta-spark={version= "==2.0.0", index = "artifactory-pypi"}
pytest = {version = "==7.1.2", index = "artifactory-pypi"}
pytest-cov ={version= "==3.0.0", index = "artifactory-pypi"}
...other packages

The artifactory-pypi is a mirror of pypi. artifactory-pypi 是 pypi 的镜像。

I have gone through this and trying to setup a python project for unit testing.我已经完成了这个并尝试设置一个 python 项目进行单元测试。 The code already has this代码已经有了这个

_builder = (
        SparkSession.builder.master("local[1]")
        .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension")
        .config(
            "spark.sql.catalog.spark_catalog",
            "org.apache.spark.sql.delta.catalog.DeltaCatalog",
        )
    )
    spark: SparkSession = configure_spark_with_delta_pip(_builder).getOrCreate()

When I try to run my unit tests using pytest, it always fails at当我尝试使用 pytest 运行单元测试时,它总是失败

configure_spark_with_delta_pip(_builder).getOrCreate()

with an error that it can't connect to maven repo to download出现无法连接到 maven repo 下载的错误

delta-core_2.12;2.0.0

I am not very well versed with Java, but I have done some digging to see that within我不太熟悉 Java,但我已经做了一些挖掘以了解

/usr/local/lib/python3.9/site-packages/pyspark/jars/

folder there is a ivy2.jar file which apparently has info on what jars are needed and it tries to reach out to maven coordinates.文件夹中有一个ivy2.jar文件,该文件显然包含有关需要什么 jars 的信息,并且它试图访问 maven 坐标。 This connection is refused as I am behind a corporate proxy.此连接被拒绝,因为我在公司代理后面。 I have tried setting我试过设置

ENV HTTP_PROXY=proxyurl
ENV HTTPS_PROXY=proxyurl

According to this SO post , Maven does not honor http_proxy env variable and the answer suggested that the OP set those in some configuration files.根据这个 SO post , Maven 不尊重 http_proxy 环境变量,答案表明 OP 在某些配置文件中设置了这些变量。 But in there the OP was using a Maven image and thus had those conf files already.但是在那里,OP 使用的是 Maven 图像,因此已经有了这些 conf 文件。 I do not have such files or folders as I am just using a python image.我没有这样的文件或文件夹,因为我只是使用 python 图像。 It is just that those python images behind the scenes go and download jars.只是那些python幕后的图像go和下载jars。

I have also tried looking at spark runtime env variables especially spark.jars.repositories to see if I could set them in my docker-compose.yml but event that didn't work.我还尝试查看spark 运行时环境变量,尤其是spark.jars.repositories以查看是否可以在我的docker-compose.yml中设置它们,但事件不起作用。

How can I get this to work?我怎样才能让它工作? Can someone suggest either有人可以建议吗

  1. If it is possible to let this download go via a org artifactory?如果可以通过 org artifactory 让这个下载 go ? If so, where do I suggest it.如果是这样,我在哪里建议它。 eg for all my python packages, I am using a pypi mirror.例如,对于我所有的 python 包,我使用的是 pypi 镜像。
  2. I can also download the jars manually, but how and where do I copy them and what environment variables to set to make it work eg PATH?我也可以手动下载 jars,但是如何以及在哪里复制它们以及设置哪些环境变量以使其工作,例如 PATH?

Btw, here is the full stack trace顺便说一句,这是完整的堆栈跟踪

:: loading settings :: url = jar:file:/usr/local/lib/python3.9/site-packages/pyspark/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
Ivy Default Cache set to: /root/.ivy2/cache
The jars for the packages stored in: /root/.ivy2/jars
io.delta#delta-core_2.12 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-63bca6fa-4932-4c6d-b13f-4a339629fc26;1.0
        confs: [default]
:: resolution report :: resolve 84422ms :: artifacts dl 0ms
        :: modules in use:
        ---------------------------------------------------------------------
        |                  |            modules            ||   artifacts   |
        |       conf       | number| search|dwnlded|evicted|| number|dwnlded|
        ---------------------------------------------------------------------
        |      default     |   1   |   0   |   0   |   0   ||   0   |   0   |
        ---------------------------------------------------------------------

:: problems summary ::
:::: WARNINGS
                module not found: io.delta#delta-core_2.12;2.0.0

        ==== local-m2-cache: tried

          file:/root/.m2/repository/io/delta/delta-core_2.12/2.0.0/delta-core_2.12-2.0.0.pom

          -- artifact io.delta#delta-core_2.12;2.0.0!delta-core_2.12.jar:

          file:/root/.m2/repository/io/delta/delta-core_2.12/2.0.0/delta-core_2.12-2.0.0.jar

        ==== local-ivy-cache: tried

          /root/.ivy2/local/io.delta/delta-core_2.12/2.0.0/ivys/ivy.xml

          -- artifact io.delta#delta-core_2.12;2.0.0!delta-core_2.12.jar:

          /root/.ivy2/local/io.delta/delta-core_2.12/2.0.0/jars/delta-core_2.12.jar

        ==== central: tried

          https://repo1.maven.org/maven2/io/delta/delta-core_2.12/2.0.0/delta-core_2.12-2.0.0.pom

          -- artifact io.delta#delta-core_2.12;2.0.0!delta-core_2.12.jar:

          https://repo1.maven.org/maven2/io/delta/delta-core_2.12/2.0.0/delta-core_2.12-2.0.0.jar

        ==== spark-packages: tried

          https://repos.spark-packages.org/io/delta/delta-core_2.12/2.0.0/delta-core_2.12-2.0.0.pom

          -- artifact io.delta#delta-core_2.12;2.0.0!delta-core_2.12.jar:

          https://repos.spark-packages.org/io/delta/delta-core_2.12/2.0.0/delta-core_2.12-2.0.0.jar

                ::::::::::::::::::::::::::::::::::::::::::::::

                ::          UNRESOLVED DEPENDENCIES         ::

                ::::::::::::::::::::::::::::::::::::::::::::::

                :: io.delta#delta-core_2.12;2.0.0: not found

                ::::::::::::::::::::::::::::::::::::::::::::::


:::: ERRORS
        Server access error at url https://repo1.maven.org/maven2/io/delta/delta-core_2.12/2.0.0/delta-core_2.12-2.0.0.pom (java.net.ConnectException: Connection refused (Connection refused))

        Server access error at url https://repo1.maven.org/maven2/io/delta/delta-core_2.12/2.0.0/delta-core_2.12-2.0.0.jar (java.net.ConnectException: Connection refused (Connection refused))

        Server access error at url https://repos.spark-packages.org/io/delta/delta-core_2.12/2.0.0/delta-core_2.12-2.0.0.pom (java.net.ConnectException: Connection refused (Connection refused))

        Server access error at url https://repos.spark-packages.org/io/delta/delta-core_2.12/2.0.0/delta-core_2.12-2.0.0.jar (java.net.ConnectException: Connection refused (Connection refused))


:: USE VERBOSE OR DEBUG MESSAGE LEVEL FOR MORE DETAILS
Exception in thread "main" java.lang.RuntimeException: [unresolved dependency: io.delta#delta-core_2.12;2.0.0: not found]

Any questions and I can try to elaborate.任何问题,我可以尝试详细说明。

Thanks to the SO community, I have been able to solve this- basically by combining information from multiple SO posts and answers.感谢 SO 社区,我已经能够解决这个问题——基本上是通过组合来自多个 SO 帖子和答案的信息。

So, this is how:所以,这就是:

  1. As I mentioned in my question, I already had the env variables for proxy set正如我在问题中提到的,我已经有了代理集的环境变量
ENV HTTP_PROXY=proxyurl
ENV HTTPS_PROXY=proxyurl
  1. Then as mentioned in this post , especially the answer by Thomas Decaux.然后正如这篇文章中提到的,尤其是 Thomas Decaux 的回答。
  2. Then this post about where to find the spark-defaults.config file when you install pyspark through pip.然后这篇关于在通过 pip 安装 pyspark 时在哪里可以找到 spark-defaults.config 文件的帖子。
  3. Then this post about setting the relevant environment variables.然后这篇关于设置相关环境变量的帖子

So, combining all of them, this was how I did it in my dockerfile所以,结合所有这些,这就是我在 dockerfile 中所做的

ENV HTTP_PROXY=proxyurl
ENV HTTPS_PROXY=proxyurl
WORKDIR /usr/local/lib/python3.9/site-packages/pyspark
RUN mkdir conf
WORKDIR /usr/local/lib/python3.9/site-packages/pyspark/conf
RUN echo "spark.driver.extraJavaOptions=-Djava.net.useSystemProxies=true" > spark-defaults.conf
ENV SPARK_HOME=/usr/local/lib/python3.9/site-packages/pyspark
ENV SPARK_PYTHON=python3

Other app specific stuffs

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM