Azure 数据块 PySpark 自定义 UDF ModuleNotFoundError：没有命名的模块

Question

I was checking this SO but none of the solutions helped PySpark custom UDF ModuleNotFoundError: No module named我正在检查这个 SO 但没有任何解决方案帮助PySpark 自定义 UDF ModuleNotFoundError：没有命名的模块

I have the current repo on azure databricks:我在 azure 数据块上有当前的回购协议：

|-run_pipeline.py
|-__init__.py
|-data_science
|--__init.py__
|--text_cleaning
|---text_cleaning.py
|---__init.py__

On the run_pipeline notebook I have this在run_pipeline笔记本上我有这个

from data_science.text_cleaning import text_cleaning
path = os.path.join(os.path.dirname(__file__), os.pardir)
sys.path.append(path)
spark = SparkSession.builder.master(
    "local[*]").appName('workflow').getOrCreate()

df = text_cleaning.basic_clean(spark_df)

On the text_cleaning.py I have a function called basic_clean that will run something like this:在text_cleaning.py上，我有一个名为 basic_clean 的 function，它将运行如下内容：

 def basic_clean(df):
    print('Removing links')
    udf_remove_links = udf(_remove_links, StringType())
    df = df.withColumn("cleaned_message", udf_remove_links("cleaned_message"))
    return df

When I do df.show() on the run_pipeline notebook, I get this error message:当我在run_pipeline笔记本上执行df.show()时，我收到以下错误消息：

Exception has occurred: PythonException       (note: full exception trace is shown but execution is paused at: <module>)
An exception was thrown from a UDF: 'pyspark.serializers.SerializationError: Caused by Traceback (most recent call last):
  File "/databricks/spark/python/pyspark/serializers.py", line 165, in _read_with_length
    return self.loads(obj)
  File "/databricks/spark/python/pyspark/serializers.py", line 466, in loads
    return pickle.loads(obj, encoding=encoding)
ModuleNotFoundError: No module named 'data_science''. Full traceback below:
Traceback (most recent call last):
  File "/databricks/spark/python/pyspark/serializers.py", line 165, in _read_with_length
    return self.loads(obj)
  File "/databricks/spark/python/pyspark/serializers.py", line 466, in loads
    return pickle.loads(obj, encoding=encoding)
ModuleNotFoundError: No module named 'data_science'

Shouldnt the imports work?进口不应该工作吗？ Why is this an issue?为什么这是一个问题？

Answer 1

It seems data-science module is missing on cluster.集群上似乎缺少data-science模块。 Kindly consider installing it on cluster.请考虑将其安装在集群上。 Please check below link about installing libraries to cluster.请检查以下有关将库安装到集群的链接。 https://learn.microsoft.com/en-us/azure/databricks/libraries/cluster-libraries https://learn.microsoft.com/en-us/azure/databricks/libraries/cluster-libraries

You can consider executing pip list command to see libraries installed on cluster.您可以考虑执行pip list命令来查看集群上安装的库。

You can consider running pip install data_science command also directly in notebook cell.您可以考虑直接在笔记本单元中运行pip install data_science命令。

Answer 2

I've been facing the same issue running pyspark tests with UDFs in Azure Devops.我在 Azure Devops 中使用 UDF 运行 pyspark 测试时遇到了同样的问题。 I've noticed that this happens when running from the pool with vmImage:ubuntu-latest .我注意到在使用vmImage:ubuntu-latest从池中运行时会发生这种情况。 When I use a custom container build from the following Dockerfile, the tests run fine:当我使用来自以下 Dockerfile 的自定义容器构建时，测试运行良好：

FROM python:3.8.3-slim-buster AS py3
FROM openjdk:8-slim-buster

ENV PYSPARK_VER=3.3.0
ENV DELTASPARK_VER=2.1.0

COPY --from=py3 / /

WORKDIR /setup

COPY requirements.txt .

RUN python -m pip install --upgrade pip
RUN pip install --no-cache-dir -r requirements.txt && \
  rm requirements.txt

WORKDIR /code

requirements.txt contains pyspark==3.3.0 and delta-spark==2.1.0 . requirements.txt 包含pyspark==3.3.0和delta-spark==2.1.0 。

This led me to conclude that it's due to how spark runs in the default ubuntu VM which runs python 3.10.6 and java 11 (at the time of posting this).这让我得出结论，这是由于 spark 在默认 ubuntu VM 中运行的方式，该 VM 运行 python 3.10.6 和 java 11（在发布时）。 I've tried setting env variables such as PYSPARK_PYTHON to enforce pyspark to use the same python binary on which the to-be-tested package is installed but to no avail.我尝试设置 env 变量，例如PYSPARK_PYTHON以强制 pyspark 使用相同的 python 二进制文件，在该二进制文件上安装了待测试的 package，但无济于事。

Maybe you can use this information to find a way to use the default agent pool's ubuntu vm to get it to work, otherwise I recommend just using a pre-configured container like I did.也许您可以使用此信息找到一种方法来使用默认代理池的 ubuntu 虚拟机来使其正常工作，否则我建议像我一样使用预配置的容器。

Azure 数据块 PySpark 自定义 UDF ModuleNotFoundError：没有命名的模块

问题描述

2 个解决方案

解决方案1
0 2022-12-05 05:53:18

解决方案2
0 2023-02-02 09:10:12

Azure 数据块 PySpark 自定义 UDF ModuleNotFoundError：没有命名的模块

问题描述

2 个解决方案

解决方案1 0 2022-12-05 05:53:18

解决方案2 0 2023-02-02 09:10:12

解决方案1
0 2022-12-05 05:53:18

解决方案2
0 2023-02-02 09:10:12