pandas version is not updated after installing a new version on databricks

Question

I am trying to solve a problem of pandas when I run python3.7 code on databricks.

The error is:

 ImportError: cannot import name 'roperator' from 'pandas.core.ops' (/databricks/python/lib/python3.7/site-packages/pandas/core/ops.py)

the pandas version:

pd.__version__
0.24.2

I run

 from pandas.core.ops import roperator

well on my laptop with

pandas 0.25.1

So, I tried to upgrade pandas on databricks.

%sh pip uninstall -y pandas
Successfully uninstalled pandas-1.1.2

%sh pip install pandas==0.25.1
 Collecting pandas==0.25.1
 Downloading pandas-0.25.1-cp37-cp37m-manylinux1_x86_64.whl (10.4 MB)
 Requirement already satisfied: python-dateutil>=2.6.1 in /databricks/conda/envs/databricks-ml/lib/python3.7/site-packages (from pandas==0.25.1) (2.8.0)
 Requirement already satisfied: numpy>=1.13.3 in /databricks/conda/envs/databricks-ml/lib/python3.7/site-packages (from pandas==0.25.1) (1.16.2)
 Requirement already satisfied: pytz>=2017.2 in /databricks/conda/envs/databricks-ml/lib/python3.7/site-packages (from pandas==0.25.1) (2018.9)
 Requirement already satisfied: six>=1.5 in /databricks/conda/envs/databricks-ml/lib/python3.7/site-packages (from python-dateutil>=2.6.1->pandas==0.25.1) (1.12.0)
 Installing collected packages: pandas
 ERROR: After October 2020 you may experience errors when installing or updating packages. 
  This is because pip will change the way that it resolves dependency conflicts.

  We recommend you use --use-feature=2020-resolver to test your packages with the new resolver before it becomes the default.

  mlflow 1.8.0 requires alembic, which is not installed.
  mlflow 1.8.0 requires prometheus-flask-exporter, which is not installed.
  mlflow 1.8.0 requires sqlalchemy<=1.3.13, which is not installed.
  sklearn-pandas 2.0.1 requires numpy>=1.18.1, but you'll have numpy 1.16.2 which is incompatible.
   sklearn-pandas 2.0.1 requires pandas>=1.0.5, but you'll have pandas 0.25.1 which is incompatible.
   sklearn-pandas 2.0.1 requires scikit-learn>=0.23.0, but you'll have scikit-learn 0.20.3 which is incompatible.
   sklearn-pandas 2.0.1 requires scipy>=1.4.1, but you'll have scipy 1.2.1 which is incompatible.
   Successfully installed pandas-0.25.1

When I run:

 import pandas as pd
  pd.__version__

it is still:

 0.24.2

Did I missed something ?

thanks

Answer 1

It's really recommended to install libraries via cluster initialization script . The %sh command is executed only on the driver node, but not on the executor nodes. And it also doesn't affect Python instance that is already running.

The correct solution will be to use dbutils.library commands , like this:

dbutils.library.installPyPI("pandas", "1.0.1")
dbutils.library.restartPython()

this will install library to all places, but it will require restarting of the Python to pickup new libraries.

Also, although it's possible to specify only package name, it's recommended to specify version explicitly, as some of the library version may not be compatible with runtime. Also, consider usage of the newer runtimes where library versions are already updated - check therelease notes for runtimes to figure out the library versions installed out of the box.

For newer Databricks runtimes you can use new magic commands: %pip and %conda to install dependencies. See the documentation for more details.

pandas version is not updated after installing a new version on databricks

Question

1 answers

solution1
2 ACCPTED 2020-09-10 09:19:16

pandas version is not updated after installing a new version on databricks

Question

1 answers

solution1 2 ACCPTED 2020-09-10 09:19:16

solution1
2 ACCPTED 2020-09-10 09:19:16