[英]emr-container pyspark job running indefinitely
Here's my Python script:这是我的 Python 脚本:
import calendar
import pydeequ
import boto3
import psycopg2
import os
import pyspark
from py4j import *
from pyspark.sql import SparkSession,Row
from pydeequ.profiles import *
from pydeequ.suggestions import *
from pydeequ.repository import *
from pydeequ.analyzers import *
from pyspark.sql import SparkSession
from botocore.config import Config
from datetime import datetime,timedelta,date
from pyspark.conf import SparkConf
from pydeequ.checks import *
from pydeequ.verification import *
from py4j.java_gateway import java_import
print(os.system("""pyspark --version"""))
spark = (SparkSession.builder \
.appName('run_dq_for_xpertrak_pathtrak') \
.enableHiveSupport() \
.config(conf=SparkConf()) \
.config("spark.jars.packages", pydeequ.deequ_maven_coord) \
.config("spark.jars.excludes", pydeequ.f2j_maven_coord) \
.getOrCreate())
java_import(spark._sc._jvm, "org.apache.spark.sql.*")
print('here---')
print(spark)
junk = spark.sql("""SELECT * FROM xpertrak.pathtrak LIMIT 10""")
print(junk)
Within AWS emr-containers
(ie EMR on EKS ), this job successfully runs and UI shows that indeed the job completed.在 AWS
emr-containers
(即EKS 上的 EMR )中,该作业成功运行并且 UI 显示该作业确实已完成。 However, when I include or append the following lines of code to the bottom of script above, the job technically completes (based on simple logs prints), but the UI never changes from the running state...但是,当我将以下代码行包含或附加到上面脚本的底部时,该作业在技术上完成了(基于简单的日志打印),但 UI 永远不会从运行状态改变...
print('checking')
check = Check(spark, level=CheckLevel.Warning, description="Data Validation Check")
checkResult = VerificationSuite(spark) \
.onData(junk) \
.addCheck(
check.hasSize(lambda x: x >= 5000000)
).run()
print(checkResult)
print('check')
This is what that looks like the AWS console/UI:这就是 AWS 控制台/UI 的样子:
What could be causing this anomaly?是什么导致了这种异常?
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.