What is the best practice to install IsolationForest in DataBrick platform for PySpark API?

Question

I'm trying to install Isolation Forest package in DataBrick platform. The version of spark in databrick is 3.1.1:

print (pyspark.__version__) 
#3.1.1

So I tried to follow this article to implement IsolationForest but I couldn't install the package from this repo with following steps:

Step 1. Package spark-iforest jar and deploy it into spark lib

cd spark-iforest/

mvn clean package -DskipTests

cp target/spark-iforest-.jar $SPARK_HOME/jars/

Step 2. Package pyspark-iforest and install it via pip, skip this step if you don't need the >python pkg

cd spark-iforest/python

python setup.py sdist

pip install dist/pyspark-iforest-.tar.gz

So basically I run following scripts and get: ModuleNotFoundError: No module named 'pyspark_iforest'

from pyspark import SparkConf
from pyspark.sql import SparkSession, functions as F
from pyspark.ml.feature import VectorAssembler, StandardScaler
from pyspark_iforest.ml.iforest import IForest, IForestModel
import tempfile

conf = SparkConf()
conf.set('spark.jars', '/full/path/to/spark-iforest-2.4.0.jar')

spark = SparkSession \
        .builder \
        .config(conf=conf) \
        .appName("IForestExample") \
        .getOrCreate()

temp_path = tempfile.mkdtemp()
iforest_path = temp_path + "/iforest"
model_path = temp_path + "/iforest_model"

What is the best practice to install IsolationForest in DataBrick platform for PySpark?

Answer 1

This specific version of isolation forest is compiled for the Spark 2.4 and Scala 2.11 , and is binary incompatible with the Spark 3.1 that you're using. You may try to use Databricks Runtime (DBR) versions that are based on the Spark 2.4 - 6.4 or 5.4.

You may look onto the mmlspark (Microsoft Machine Learning for Apache Spark) library developed by Microsoft - it has an implementation of IsolationForest , although I haven't used it myself.

Answer 2

You need to import the jar first before using it:

from pyspark import SparkConf
from pyspark.sql import SparkSession, functions as F
from pyspark.ml.feature import VectorAssembler, StandardScaler
import tempfile

conf = SparkConf()
conf.set('spark.jars', '/full/path/to/spark-iforest-2.4.0.jar')

spark = SparkSession \
        .builder \
        .config(conf=conf) \
        .appName("IForestExample") \
        .getOrCreate()


from pyspark_iforest.ml.iforest import IForest, IForestModel

temp_path = tempfile.mkdtemp()
iforest_path = temp_path + "/iforest"
model_path = temp_path + "/iforest_model"

I usually have the spark session created on a separate.py file or provide the spark.jars by using the spark-submit command, because the way jars are loaded sometimes give me trouble when adding them within the code only.

spark-submit --jars /full/path/to/spark-iforest-2.4.0.jar my_code.py

Also, there is a version mismatch, as @Alex Ott mentioned but the error would be different in that case. Building IForest with pyspark 3.x is not very difficult, but if you don't want to get into it, you could downgrade the pyspark version.

What is the best practice to install IsolationForest in DataBrick platform for PySpark API?

Question

2 answers

solution1
3 2021-08-23 19:02:09

solution2
1 2021-08-24 11:03:50

What is the best practice to install IsolationForest in DataBrick platform for PySpark API?

Question

2 answers

solution1 3 2021-08-23 19:02:09

solution2 1 2021-08-24 11:03:50

solution1
3 2021-08-23 19:02:09

solution2
1 2021-08-24 11:03:50