简体   繁体   English

在 Glue 3.0 上导入 PyDeequ 包时出错

[英]Error importing PyDeequ package on Glue 3.0

I am trying to import pydeequ lib in aws enviroment bulding a job with glue.我正在尝试在 aws 环境中导入 pydeequ lib,用胶水构建一个工作。 So, I put a zip file of pydeequ in Python library path and jars file in Dependent JARs path .因此,我将 pydeequ 的 zip 文件放在Python library path中,将 jars 文件放在Dependent JARs path中。 My script is the following:我的脚本如下:

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
import pydeequ
from pydeequ.analyzers import *
import findspark
findspark.init()

args = getResolvedOptions(sys.argv, ['JOB_NAME'])

spark = (SparkSession\
   .builder\
   .config("spark.jars.packages", pydeequ.deequ_maven_coord)\
   .config("spark.jars.excludes", pydeequ.f2j_maven_coord)\
   .getOrCreate())

sc = SparkContext()
glueContext = GlueContext(sc)
job = Job(glueContext)
job.init(args['JOB_NAME'], args)

But, I couldn't import the pydeequ lib and I have the following error:但是,我无法导入 pydeequ 库,并且出现以下错误:

2022-12-21 17:50:31,717 ERROR [main] glue.ProcessLauncher (Logging.scala:logError(73)): Error from Python:Traceback (most recent call last):
  File "/tmp/Test_Pydeequ.py", line 7, in <module>
    import pydeequ
  File "<frozen importlib._bootstrap>", line 983, in _find_and_load
  File "<frozen importlib._bootstrap>", line 967, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 668, in _load_unlocked
  File "<frozen importlib._bootstrap>", line 638, in _load_backward_compatible
  File "/tmp/pydeequ.zip/pydeequ/__init__.py", line 21, in <module>
    from pydeequ.configs import DEEQU_MAVEN_COORD
  File "<frozen importlib._bootstrap>", line 983, in _find_and_load
  File "<frozen importlib._bootstrap>", line 967, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 668, in _load_unlocked
  File "<frozen importlib._bootstrap>", line 638, in _load_backward_compatible
  File "/tmp/pydeequ.zip/pydeequ/configs.py", line 37, in <module>
    DEEQU_MAVEN_COORD = _get_deequ_maven_config()
  File "/tmp/pydeequ.zip/pydeequ/configs.py", line 28, in _get_deequ_maven_config
    spark_version = _get_spark_version()
  File "/tmp/pydeequ.zip/pydeequ/configs.py", line 23, in _get_spark_version
    spark_version = output.stdout.decode().split("\n")[-2]
IndexError: list index out of range

I need to work with pydeequ lib inside aws enviroment and I don't know why I had this problem.我需要在 aws 环境中使用 pydeequ lib,但我不知道为什么会遇到这个问题。

I appreciate very much if someone could help me to solve this problem.如果有人可以帮助我解决这个问题,我将不胜感激。

So, I solved the problem doing two things:所以,我通过两件事解决了这个问题:

  1. First step Solution.第一步解决方案。

I had to open the configs.py file of pydeequ and change the code in _get_spark_version() metod.我必须打开 pydeequ 的 configs.py 文件并更改_get_spark_version()方法中的代码。 The code became this way:代码变成了这样:

@lru_cache(maxsize=None)
def _get_spark_version() -> str:
    # Get version from a subprocess so we don't mess up with existing SparkContexts.
    command = [
        "python",
        "-c",
        "from pyspark import SparkContext; print(SparkContext.getOrCreate()._jsc.version())",
    ]
    output = subprocess.run(command, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
    #spark_version = output.stdout.decode().split("\n")[-2]
    spark_version = '3.1.1'
    return spark_version

I simply commented the original spark_version declaration and wrote '3.1.1'.我只是简单地评论了原始的 spark_version 声明并写了'3.1.1'。 That is the spark version used in Glue 3.0.那就是 Glue 3.0 中使用的 spark 版本。

  1. Second step solution第二步解决方案

I also was using a wrong jar file version for deequ.我还为 deequ 使用了错误的 jar 文件版本。 The version was 1.0.3 and this is not compatible with spark version 3.1.1 used by glue 3.0.版本是 1.0.3,这与 glue 3.0 使用的 spark 版本 3.1.1 不兼容。 So, I have to download jar file for 2.0.0 version of deequ.所以,我必须下载 2.0.0 版本的 deequ 的 jar 文件。

Finally, it works !最后,它起作用了!

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 可以在 AWS Glue 3.0 中使用 Spark 3.3.0 - It is possible use Spark 3.3.0 in AWS Glue 3.0 AWS glue 端点和端口转发以及 spark 程序抛出错误 - AWS glue endpoint and port forwarding and spark program throws error 在 datalab 中导入 gcsfs 时出错 - Importing gcsfs in datalab is giving an error 服务无法承担角色错误 - 使用 JDBC 目标创建 AWS Glue cloudformation - Service is unable to assume a role error - AWS Glue cloudformation creation with JDBC target AWS Glue 作业:调用 getCatalogSource 时出错。 无.get - AWS Glue Job : An error occurred while calling getCatalogSource. None.get 在 golang 中导入 Package 时,VS 代码显示 BrokenImport? - VS code Shows BrokenImport While importing a Package in golang? AWS Glue:拒绝访问(服务:Amazon S3;状态代码:403;错误代码:AccessDenied; - AWS Glue: Access Denied (Service: Amazon S3; Status Code: 403; Error Code: AccessDenied; 将 JSONL 数据集导入 Vertex AI 时出错 - Error importing JSONL dataset into Vertex AI 导入 BigQueryExecuteQueryOperator 时出现不兼容的架构错误 - imcompatible architecture error when importing BigQueryExecuteQueryOperator AWS Glue 书签 - AWS Glue Bookmarks
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM