[英]Azure databricks PySpark custom UDF ModuleNotFoundError: No module named
I was checking this SO but none of the solutions helped PySpark custom UDF ModuleNotFoundError: No module named我正在检查这个 SO 但没有任何解决方案帮助PySpark 自定义 UDF ModuleNotFoundError:没有命名的模块
I have the current repo on azure databricks:我在 azure 数据块上有当前的回购协议:
|-run_pipeline.py
|-__init__.py
|-data_science
|--__init.py__
|--text_cleaning
|---text_cleaning.py
|---__init.py__
On the run_pipeline
notebook I have this在
run_pipeline
笔记本上我有这个
from data_science.text_cleaning import text_cleaning
path = os.path.join(os.path.dirname(__file__), os.pardir)
sys.path.append(path)
spark = SparkSession.builder.master(
"local[*]").appName('workflow').getOrCreate()
df = text_cleaning.basic_clean(spark_df)
On the text_cleaning.py
I have a function called basic_clean that will run something like this:在
text_cleaning.py
上,我有一个名为 basic_clean 的 function,它将运行如下内容:
def basic_clean(df):
print('Removing links')
udf_remove_links = udf(_remove_links, StringType())
df = df.withColumn("cleaned_message", udf_remove_links("cleaned_message"))
return df
When I do df.show()
on the run_pipeline
notebook, I get this error message:当我在
run_pipeline
笔记本上执行df.show()
时,我收到以下错误消息:
Exception has occurred: PythonException (note: full exception trace is shown but execution is paused at: <module>)
An exception was thrown from a UDF: 'pyspark.serializers.SerializationError: Caused by Traceback (most recent call last):
File "/databricks/spark/python/pyspark/serializers.py", line 165, in _read_with_length
return self.loads(obj)
File "/databricks/spark/python/pyspark/serializers.py", line 466, in loads
return pickle.loads(obj, encoding=encoding)
ModuleNotFoundError: No module named 'data_science''. Full traceback below:
Traceback (most recent call last):
File "/databricks/spark/python/pyspark/serializers.py", line 165, in _read_with_length
return self.loads(obj)
File "/databricks/spark/python/pyspark/serializers.py", line 466, in loads
return pickle.loads(obj, encoding=encoding)
ModuleNotFoundError: No module named 'data_science'
Shouldnt the imports work?进口不应该工作吗? Why is this an issue?
为什么这是一个问题?
It seems
data-science
module is missing on cluster.集群上似乎缺少
data-science
模块。 Kindly consider installing it on cluster.请考虑将其安装在集群上。 Please check below link about installing libraries to cluster.
请检查以下有关将库安装到集群的链接。 https://learn.microsoft.com/en-us/azure/databricks/libraries/cluster-libraries
https://learn.microsoft.com/en-us/azure/databricks/libraries/cluster-libraries
You can consider executing pip list
command to see libraries installed on cluster.您可以考虑执行
pip list
命令来查看集群上安装的库。
You can consider running pip install data_science
command also directly in notebook cell.您可以考虑直接在笔记本单元中运行
pip install data_science
命令。
I've been facing the same issue running pyspark tests with UDFs in Azure Devops.我在 Azure Devops 中使用 UDF 运行 pyspark 测试时遇到了同样的问题。 I've noticed that this happens when running from the pool with
vmImage:ubuntu-latest
.我注意到在使用
vmImage:ubuntu-latest
从池中运行时会发生这种情况。 When I use a custom container build from the following Dockerfile, the tests run fine:当我使用来自以下 Dockerfile 的自定义容器构建时,测试运行良好:
FROM python:3.8.3-slim-buster AS py3
FROM openjdk:8-slim-buster
ENV PYSPARK_VER=3.3.0
ENV DELTASPARK_VER=2.1.0
COPY --from=py3 / /
WORKDIR /setup
COPY requirements.txt .
RUN python -m pip install --upgrade pip
RUN pip install --no-cache-dir -r requirements.txt && \
rm requirements.txt
WORKDIR /code
requirements.txt contains pyspark==3.3.0
and delta-spark==2.1.0
. requirements.txt 包含
pyspark==3.3.0
和delta-spark==2.1.0
。
This led me to conclude that it's due to how spark runs in the default ubuntu VM which runs python 3.10.6 and java 11 (at the time of posting this).这让我得出结论,这是由于 spark 在默认 ubuntu VM 中运行的方式,该 VM 运行 python 3.10.6 和 java 11(在发布时)。 I've tried setting env variables such as
PYSPARK_PYTHON
to enforce pyspark to use the same python binary on which the to-be-tested package is installed but to no avail.我尝试设置 env 变量,例如
PYSPARK_PYTHON
以强制 pyspark 使用相同的 python 二进制文件,在该二进制文件上安装了待测试的 package,但无济于事。
Maybe you can use this information to find a way to use the default agent pool's ubuntu vm to get it to work, otherwise I recommend just using a pre-configured container like I did.也许您可以使用此信息找到一种方法来使用默认代理池的 ubuntu 虚拟机来使其正常工作,否则我建议像我一样使用预配置的容器。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.