[英]Error importing PyDeequ package on Glue 3.0
I am trying to import pydeequ lib in aws enviroment bulding a job with glue.我正在尝试在 aws 环境中导入 pydeequ lib,用胶水构建一个工作。 So, I put a zip file of pydeequ in Python library path and jars file in Dependent JARs path .
因此,我将 pydeequ 的 zip 文件放在Python library path中,将 jars 文件放在Dependent JARs path中。 My script is the following:
我的脚本如下:
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
import pydeequ
from pydeequ.analyzers import *
import findspark
findspark.init()
args = getResolvedOptions(sys.argv, ['JOB_NAME'])
spark = (SparkSession\
.builder\
.config("spark.jars.packages", pydeequ.deequ_maven_coord)\
.config("spark.jars.excludes", pydeequ.f2j_maven_coord)\
.getOrCreate())
sc = SparkContext()
glueContext = GlueContext(sc)
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
But, I couldn't import the pydeequ lib and I have the following error:但是,我无法导入 pydeequ 库,并且出现以下错误:
2022-12-21 17:50:31,717 ERROR [main] glue.ProcessLauncher (Logging.scala:logError(73)): Error from Python:Traceback (most recent call last):
File "/tmp/Test_Pydeequ.py", line 7, in <module>
import pydeequ
File "<frozen importlib._bootstrap>", line 983, in _find_and_load
File "<frozen importlib._bootstrap>", line 967, in _find_and_load_unlocked
File "<frozen importlib._bootstrap>", line 668, in _load_unlocked
File "<frozen importlib._bootstrap>", line 638, in _load_backward_compatible
File "/tmp/pydeequ.zip/pydeequ/__init__.py", line 21, in <module>
from pydeequ.configs import DEEQU_MAVEN_COORD
File "<frozen importlib._bootstrap>", line 983, in _find_and_load
File "<frozen importlib._bootstrap>", line 967, in _find_and_load_unlocked
File "<frozen importlib._bootstrap>", line 668, in _load_unlocked
File "<frozen importlib._bootstrap>", line 638, in _load_backward_compatible
File "/tmp/pydeequ.zip/pydeequ/configs.py", line 37, in <module>
DEEQU_MAVEN_COORD = _get_deequ_maven_config()
File "/tmp/pydeequ.zip/pydeequ/configs.py", line 28, in _get_deequ_maven_config
spark_version = _get_spark_version()
File "/tmp/pydeequ.zip/pydeequ/configs.py", line 23, in _get_spark_version
spark_version = output.stdout.decode().split("\n")[-2]
IndexError: list index out of range
I need to work with pydeequ lib inside aws enviroment and I don't know why I had this problem.我需要在 aws 环境中使用 pydeequ lib,但我不知道为什么会遇到这个问题。
I appreciate very much if someone could help me to solve this problem.如果有人可以帮助我解决这个问题,我将不胜感激。
So, I solved the problem doing two things:所以,我通过两件事解决了这个问题:
I had to open the configs.py file of pydeequ and change the code in _get_spark_version() metod.我必须打开 pydeequ 的 configs.py 文件并更改_get_spark_version()方法中的代码。 The code became this way:
代码变成了这样:
@lru_cache(maxsize=None)
def _get_spark_version() -> str:
# Get version from a subprocess so we don't mess up with existing SparkContexts.
command = [
"python",
"-c",
"from pyspark import SparkContext; print(SparkContext.getOrCreate()._jsc.version())",
]
output = subprocess.run(command, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
#spark_version = output.stdout.decode().split("\n")[-2]
spark_version = '3.1.1'
return spark_version
I simply commented the original spark_version declaration and wrote '3.1.1'.我只是简单地评论了原始的 spark_version 声明并写了'3.1.1'。 That is the spark version used in Glue 3.0.
那就是 Glue 3.0 中使用的 spark 版本。
I also was using a wrong jar file version for deequ.我还为 deequ 使用了错误的 jar 文件版本。 The version was 1.0.3 and this is not compatible with spark version 3.1.1 used by glue 3.0.
版本是 1.0.3,这与 glue 3.0 使用的 spark 版本 3.1.1 不兼容。 So, I have to download jar file for 2.0.0 version of deequ.
所以,我必须下载 2.0.0 版本的 deequ 的 jar 文件。
Finally, it works !最后,它起作用了!
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.