How to install Spark dependencies in Google AI Platform in Google Cloud Platform

Question

In the Google Colab, the current directory is /myContent and that directory has the following content:

setup.py      spark-2.4.5-bin-hadoop2.7.tgz     trainer/

In the trainer folder, it has __init__.py task.py . task.py has my Python code, with import pyspark

Here is the code snippet in the setup.py which install the spark dependencies file:

from setuptools import find_packages
from setuptools import setup

REQUIRED_PACKAGES = ['spark-2.4.5-bin-hadoop2.7.tgz']

setup(
    name='trainer',
    version='0.1',
    install_requires=REQUIRED_PACKAGES,
    packages=find_packages(),
    include_package_data=True,
    description='My training application package.'
)

When I submit a training job to the Google AI Cloud running the code below at /myContent directory:

!gcloud ai-platform jobs submit training $JOB_NAME \
    --package-path $ACKAGE_PATH \
    --module-name $MODULE \
    --staging-bucket $STAGING_PATH \
    --scale-tier custom \
    --master-machine-type complex_model_l_gpu \
    --worker-machine-type complex_model_l_gpu \
    --worker-count 2 \
    --runtime-version 2.1 \
    --python-version 3.7 \
    --packages spark-2.4.5-bin-hadoop2.7.tgz \
    --job-dir $JOB_DIR \
    -- \
    --param_A=1 \
    --param_B=2 \

The job fails with an error message from the logs:

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/usr/lib/python3.7/tokenize.py", line 447, in open
    buffer = _builtin_open(filename, 'rb')
FileNotFoundError: [Errno 2] No such file or directory: '/tmp/pip-req-build-b_xhvahl/setup.py'

1) I have submitted the setup.py to the Google AI platform, why it does not find that py file?

2) How to install the spark dependency file in the Google AI Platform beforehand? In the Google Colab Jupyter Notebook, I always run the following code in the cell:

# install spark
%cd
!apt-get install openjdk-8-jdk-headless -qq 
!wget https://archive.apache.org/dist/spark/spark-2.4.5/spark-2.4.5-bin-hadoop2.7.tgz
!tar -xvf spark-2.4.5-bin-hadoop2.7.tgz > /dev/null
!pip install -q findspark
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/root/spark-2.4.5-bin-hadoop2.7"

Thank you

Answer 1

I have submitted the setup.py to the Google AI platform, why it does not find that py file?

I replicated the same error, and most probably the issue is caused by the file.tgz since the supported ones are only .tar.gz and .whl , see manual build and adding custom dependencies . In setup.py you are referencing a zipped file (spark-2.4.5-bin-hadoop2.7.tgz), but as far as I know, the option install_requires should include a PYPI package or a directory with the required dependencies.

How to install the spark dependency file in the Google AI Platform beforehand? In the Google Colab Jupyter Notebook, I always run the following code in the cell:

In the Jupyter cell you are extracting the.tgz file and installing the spark binaries when you set up the SPARK_HOME variable. This is different to the procedure with the file setup.py . I noticed that the spark documentation says " PySpark is now available in pypi. To install just run pip install pyspark "; so, in order to use import pyspark you can opt to install it by:

Using the file pyspark-3.0.0.tar.gz instead of spark spark-2.4.5-bin-hadoop2.7.tgz.
Specifying pyspark in the setup.py file, for example: install_requires=['pyspark>=2.4.5'] , and follow the guidelines to properly configure the setup.py file.

How to install Spark dependencies in Google AI Platform in Google Cloud Platform

Question

1 answers

solution1
0 2020-06-26 15:10:05

How to install Spark dependencies in Google AI Platform in Google Cloud Platform

Question

1 answers

solution1 0 2020-06-26 15:10:05

solution1
0 2020-06-26 15:10:05