I was checking this SO but none of the solutions helped PySpark custom UDF ModuleNotFoundError: No module named
I have the current repo on azure databricks:
|-run_pipeline.py
|-__init__.py
|-data_science
|--__init.py__
|--text_cleaning
|---text_cleaning.py
|---__init.py__
On the run_pipeline
notebook I have this
from data_science.text_cleaning import text_cleaning
path = os.path.join(os.path.dirname(__file__), os.pardir)
sys.path.append(path)
spark = SparkSession.builder.master(
"local[*]").appName('workflow').getOrCreate()
df = text_cleaning.basic_clean(spark_df)
On the text_cleaning.py
I have a function called basic_clean that will run something like this:
def basic_clean(df):
print('Removing links')
udf_remove_links = udf(_remove_links, StringType())
df = df.withColumn("cleaned_message", udf_remove_links("cleaned_message"))
return df
When I do df.show()
on the run_pipeline
notebook, I get this error message:
Exception has occurred: PythonException (note: full exception trace is shown but execution is paused at: <module>)
An exception was thrown from a UDF: 'pyspark.serializers.SerializationError: Caused by Traceback (most recent call last):
File "/databricks/spark/python/pyspark/serializers.py", line 165, in _read_with_length
return self.loads(obj)
File "/databricks/spark/python/pyspark/serializers.py", line 466, in loads
return pickle.loads(obj, encoding=encoding)
ModuleNotFoundError: No module named 'data_science''. Full traceback below:
Traceback (most recent call last):
File "/databricks/spark/python/pyspark/serializers.py", line 165, in _read_with_length
return self.loads(obj)
File "/databricks/spark/python/pyspark/serializers.py", line 466, in loads
return pickle.loads(obj, encoding=encoding)
ModuleNotFoundError: No module named 'data_science'
Shouldnt the imports work? Why is this an issue?
It seems
data-science
module is missing on cluster. Kindly consider installing it on cluster. Please check below link about installing libraries to cluster. https://learn.microsoft.com/en-us/azure/databricks/libraries/cluster-libraries
You can consider executing pip list
command to see libraries installed on cluster.
You can consider running pip install data_science
command also directly in notebook cell.
I've been facing the same issue running pyspark tests with UDFs in Azure Devops. I've noticed that this happens when running from the pool with vmImage:ubuntu-latest
. When I use a custom container build from the following Dockerfile, the tests run fine:
FROM python:3.8.3-slim-buster AS py3
FROM openjdk:8-slim-buster
ENV PYSPARK_VER=3.3.0
ENV DELTASPARK_VER=2.1.0
COPY --from=py3 / /
WORKDIR /setup
COPY requirements.txt .
RUN python -m pip install --upgrade pip
RUN pip install --no-cache-dir -r requirements.txt && \
rm requirements.txt
WORKDIR /code
requirements.txt contains pyspark==3.3.0
and delta-spark==2.1.0
.
This led me to conclude that it's due to how spark runs in the default ubuntu VM which runs python 3.10.6 and java 11 (at the time of posting this). I've tried setting env variables such as PYSPARK_PYTHON
to enforce pyspark to use the same python binary on which the to-be-tested package is installed but to no avail.
Maybe you can use this information to find a way to use the default agent pool's ubuntu vm to get it to work, otherwise I recommend just using a pre-configured container like I did.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.