emr-container pyspark 作业无限期运行

Question

Here's my Python script:这是我的 Python 脚本：

import calendar
import pydeequ
import boto3
import psycopg2
import os
import pyspark

from py4j import *
from pyspark.sql import SparkSession,Row
from pydeequ.profiles import *
from pydeequ.suggestions import *
from pydeequ.repository import *
from pydeequ.analyzers import *
from pyspark.sql import SparkSession
from botocore.config import Config
from datetime import datetime,timedelta,date
from pyspark.conf import SparkConf
from pydeequ.checks import *
from pydeequ.verification import *
from py4j.java_gateway import java_import

print(os.system("""pyspark --version"""))

spark = (SparkSession.builder \
        .appName('run_dq_for_xpertrak_pathtrak') \
        .enableHiveSupport() \
        .config(conf=SparkConf()) \
        .config("spark.jars.packages", pydeequ.deequ_maven_coord) \
        .config("spark.jars.excludes", pydeequ.f2j_maven_coord) \
        .getOrCreate())

java_import(spark._sc._jvm, "org.apache.spark.sql.*")

print('here---')
print(spark)

junk = spark.sql("""SELECT * FROM xpertrak.pathtrak LIMIT 10""")

print(junk)

Within AWS emr-containers (ie EMR on EKS ), this job successfully runs and UI shows that indeed the job completed.在 AWS emr-containers （即EKS 上的 EMR ）中，该作业成功运行并且 UI 显示该作业确实已完成。 However, when I include or append the following lines of code to the bottom of script above, the job technically completes (based on simple logs prints), but the UI never changes from the running state...但是，当我将以下代码行包含或附加到上面脚本的底部时，该作业在技术上完成了（基于简单的日志打印），但 UI 永远不会从运行状态改变...

print('checking')
check = Check(spark, level=CheckLevel.Warning, description="Data Validation Check")
checkResult = VerificationSuite(spark) \
    .onData(junk) \
    .addCheck(
        check.hasSize(lambda x: x >= 5000000)
    ).run()
print(checkResult)
print('check')

This is what that looks like the AWS console/UI:这就是 AWS 控制台/UI 的样子：

What could be causing this anomaly?是什么导致了这种异常？

Answer 1

Based on AWS-supplied docs from here , adding the following ended the job successfully:根据此处AWS 提供的文档，添加以下内容成功结束了作业：

spark.sparkContext._gateway.shutdown_callback_server()
spark.stop()

emr-container pyspark 作业无限期运行

问题描述

1 个解决方案

解决方案1
0 2022-12-15 19:55:46

emr-container pyspark 作业无限期运行

问题描述

1 个解决方案

解决方案1 0 2022-12-15 19:55:46

解决方案1
0 2022-12-15 19:55:46