簡體   English   中英

引導后 AWS EMR pandas 與 pyspark 中的 numpy 沖突

[英]AWS EMR pandas conflict with numpy in pyspark after bootstrapping

使用以下引導代碼啟動集群並獲得以下標准輸出后,當我嘗試在 pyspark 中導入 pandas 時,由於與標准輸出中不存在的不同 numpy 版本發生沖突,我收到以下錯誤。 因此,似乎 pyspark 選擇性地忽略了 numpy 安裝並使用導致沖突的舊版本。 我該如何解決這個問題?

我使用的emr版本是emr-5.33.0

import pandas as pd
  File "/usr/local/lib64/python3.7/site-packages/pandas/__init__.py", line 22, in <module>
    from pandas.compat import (
  File "/usr/local/lib64/python3.7/site-packages/pandas/compat/__init__.py", line 15, in <module>
    from pandas.compat.numpy import (
  File "/usr/local/lib64/python3.7/site-packages/pandas/compat/numpy/__init__.py", line 21, in <module>
    f"this version of pandas is incompatible with numpy < {_min_numpy_ver}\n"
ImportError: this version of pandas is incompatible with numpy < 1.17.3
your numpy version is 1.16.5.
Please upgrade numpy to >= 1.17.3 to use this pandas version

這是我正在使用的引導代碼

#!/bin/bash
set -x -e

echo -e 'export PYSPARK_PYTHON=/usr/bin/python3
export HADOOP_CONF_DIR=/etc/hadoop/conf
export SPARK_JARS_DIR=/usr/lib/spark/jars
export SPARK_HOME=/usr/lib/spark' >> $HOME/.bashrc && source $HOME/.bashrc

sudo python3 -m pip install
sudo python3 -m pip install numpy pandas awscli boto spark-nlp
sudo python3 -m pip freeze
sudo ls /usr/local/lib64/python3.7/site-packages/


set +x
exit 0

這是我給出的軟件配置

[{
  "Classification": "spark-env",
  "Configurations": [{
    "Classification": "export",
    "Properties": {
      "PYSPARK_PYTHON": "/usr/bin/python3"
    }
  }]
},
{
  "Classification": "spark-defaults",
    "Properties": {
      "spark.yarn.stagingDir": "hdfs:///tmp",
      "spark.yarn.preserve.staging.files": "true",
      "spark.kryoserializer.buffer.max": "2000M",
      "spark.serializer": "org.apache.spark.serializer.KryoSerializer",
      "spark.driver.maxResultSize": "0",
      "spark.jars.packages": "com.johnsnowlabs.nlp:spark-nlp_2.12:3.1.2"
    }
}
]

這是引導后我得到的黑啤酒

Collecting numpy
  Downloading https://files.pythonhosted.org/packages/2c/d2/8973eb282fc3c7e6c4db0469f0390d81d8eb9ae56dfaa2a7e6db07283682/numpy-1.21.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.whl (14.1MB)
Installing collected packages: numpy
Successfully installed numpy-1.21.0
Collecting pandas
  Downloading https://files.pythonhosted.org/packages/99/f7/01cea7f6c963100f045876eb4aa1817069c5c9eca73d2dbfb5d31ff9a39f/pandas-1.3.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.whl (10.8MB)
Collecting awscli
  Downloading https://files.pythonhosted.org/packages/aa/24/e098cf5ce28a764bca174e88f4ccb70754e9f049c9bf986e582aedcb7420/awscli-1.19.112-py2.py3-none-any.whl (3.6MB)
Requirement already satisfied: boto in /usr/local/lib/python3.7/site-packages
Collecting spark-nlp
  Downloading https://files.pythonhosted.org/packages/6a/98/5e860fdd0227b8eac3907acd5f896c9b2aae0a93cd676aaaf2aa4f48dfe0/spark_nlp-3.1.2-py2.py3-none-any.whl (45kB)
Requirement already satisfied: pytz>=2017.3 in /usr/local/lib/python3.7/site-packages (from pandas)
Requirement already satisfied: numpy>=1.17.3 in /root/.local/lib/python3.7/site-packages (from pandas)
Collecting python-dateutil>=2.7.3 (from pandas)
  Downloading https://files.pythonhosted.org/packages/36/7a/87837f39d0296e723bb9b62bbb257d0355c7f6128853c78955f57342a56d/python_dateutil-2.8.2-py2.py3-none-any.whl (247kB)
Collecting rsa<4.8,>=3.1.2; python_version > "2.7" (from awscli)
  Downloading https://files.pythonhosted.org/packages/e9/93/0c0f002031f18b53af7a6166103c02b9c0667be528944137cc954ec921b3/rsa-4.7.2-py3-none-any.whl
Collecting docutils<0.16,>=0.10 (from awscli)
  Downloading https://files.pythonhosted.org/packages/22/cd/a6aa959dca619918ccb55023b4cb151949c64d4d5d55b3f4ffd7eee0c6e8/docutils-0.15.2-py3-none-any.whl (547kB)
Requirement already satisfied: PyYAML<5.5,>=3.10 in /usr/local/lib64/python3.7/site-packages (from awscli)
Collecting s3transfer<0.5.0,>=0.4.0 (from awscli)
  Downloading https://files.pythonhosted.org/packages/63/d0/693477c688348654ddc21dcdce0817653a294aa43f41771084c25e7ff9c7/s3transfer-0.4.2-py2.py3-none-any.whl (79kB)
Collecting colorama<0.4.4,>=0.2.5 (from awscli)
  Downloading https://files.pythonhosted.org/packages/c9/dc/45cdef1b4d119eb96316b3117e6d5708a08029992b2fee2c143c7a0a5cc5/colorama-0.4.3-py2.py3-none-any.whl
Collecting botocore==1.20.112 (from awscli)
  Downloading https://files.pythonhosted.org/packages/c7/ea/11c3beca131920f552602b98d7ba9fc5b46bee6a59cbd48a95a85cbb8f41/botocore-1.20.112-py2.py3-none-any.whl (7.7MB)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.7/site-packages (from python-dateutil>=2.7.3->pandas)
Collecting pyasn1>=0.1.3 (from rsa<4.8,>=3.1.2; python_version > "2.7"->awscli)
  Downloading https://files.pythonhosted.org/packages/62/1e/a94a8d635fa3ce4cfc7f506003548d0a2447ae76fd5ca53932970fe3053f/pyasn1-0.4.8-py2.py3-none-any.whl (77kB)
Requirement already satisfied: jmespath<1.0.0,>=0.7.1 in /usr/local/lib/python3.7/site-packages (from botocore==1.20.112->awscli)
Collecting urllib3<1.27,>=1.25.4 (from botocore==1.20.112->awscli)
  Downloading https://files.pythonhosted.org/packages/5f/64/43575537846896abac0b15c3e5ac678d787a4021e906703f1766bfb8ea11/urllib3-1.26.6-py2.py3-none-any.whl (138kB)
Installing collected packages: python-dateutil, pandas, pyasn1, rsa, docutils, urllib3, botocore, s3transfer, colorama, awscli, spark-nlp
Successfully installed awscli-1.19.112 botocore-1.20.112 colorama-0.4.3 docutils-0.15.2 pandas-1.3.0 pyasn1-0.4.8 python-dateutil-2.8.2 rsa-4.7.2 s3transfer-0.4.2 spark-nlp-3.1.2 urllib3-1.26.6
awscli==1.19.112
beautifulsoup4==4.9.3
boto==2.49.0
botocore==1.20.112
click==7.1.2
colorama==0.4.3
docutils==0.15.2
jmespath==0.10.0
joblib==1.0.1
lxml==4.6.2
mysqlclient==1.4.2
nltk==3.5
nose==1.3.4
numpy==1.21.0
pandas==1.3.0
py-dateutil==2.2
pyasn1==0.4.8
python-dateutil==2.8.2
pytz==2021.1
PyYAML==5.4.1
regex==2021.3.17
rsa==4.7.2
s3transfer==0.4.2
six==1.13.0
spark-nlp==3.1.2
tqdm==4.59.0
urllib3==1.26.6
windmill==1.6
click
click-7.1.2.dist-info
joblib
joblib-1.0.1.dist-info
lxml
lxml-4.6.2-py3.7.egg-info
mysqlclient-1.4.2-py3.7.egg-info
MySQLdb
pandas
pandas-1.3.0.dist-info
PyYAML-5.4.1-py3.7.egg-info
regex
regex-2021.3.17-py3.7.egg-info
tqdm
tqdm-4.59.0.dist-info
yaml
_yaml

這個問題實際上是一個 EMR 錯誤,正在此處的 AWS 論壇上進行討論: https://forums.aws.amazon.com/thread.jspa?messageID=989210&tstart=0

我在emr 6.3.0上面臨同樣的問題; 我的解決方案是在引導腳本中設置pandas=1.2.5 在 AWS 修復問題之前,這是一個快速修復。

此外,我看到這里發布了一些解決方案/技巧。

如何在 Amazon EMR 上安裝多個版本的 numpy 以及如何刪除早期版本?

我遇到了同樣的問題。 基本上我將它添加為 EMR 步驟而不是引導腳本並且它對我有用。 如果您以某種方式在 EMR 集群 state 更改上建立索引,這可能不合適,但應該可以解除很多不需要這樣做的場景。 更多細節在這里

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM