简体   繁体   English

AWS Glue Pyspark,结束有条件的工作?

[英]AWS Glue Pyspark, End a job with a condition?

Seems like a simple task, but im having trouble finding the docs to see if it's possible.似乎是一项简单的任务,但我无法找到文档以查看是否可行。 Basically I have a glue job that runs every hour and searches a folder to see if data has been uploaded.基本上,我有一个每小时运行一次的粘合作业,并搜索一个文件夹以查看数据是否已上传。 On some occasions, no data has been uploaded in the past hour, so when the Glue function runs and sees there's no data, I'd like it to terminate.在某些情况下,过去一小时内没有上传任何数据,因此当 Glue 函数运行并看到没有数据时,我希望它终止。 Is that possible?那可能吗? Here's some pseudocode to illustrate what I mean:这是一些伪代码来说明我的意思:

def fn(input):
    *fetches list of data*
    return (list of data)

list_of_data = fn(input)
if list_of_data is None:
    Terminate Job

Yes as bdcloud mentioned it correctly, we can directly trigger the Glue job from Lambda.是的,正如 bdcloud 正确提到的,我们可以直接从 Lambda 触发 Glue 作业。 Have the event trigger created on the landing folder and when a file is uploaded, trigger the glue job.在登陆文件夹上创建事件触发器,并在上传文件时触发粘合作业。 Please find herewith a code snippet for AWS Lambda:请在此处找到 AWS Lambda 的代码片段:

from __future__ import print_function
import json
import boto3
import time
import sys
import time
from datetime import datetime

s3 = boto3.client('s3')
glue = boto3.client('glue')

def lambda_handler(event, context):
    gluejobname="<< THE GLUE JOB NAME >>"

    try:
        runId = glue.start_job_run(JobName=gluejobname)
        status = glue.get_job_run(JobName=gluejobname, RunId=runId['JobRunId'])
        print("Job Status : ", status['JobRun']['JobRunState'])
    except Exception as e:
        print(e)
        print('Error getting object {} from bucket {}. Make sure they exist '
              'and your bucket is in the same region as this '
              'function.'.format(source_bucket, source_bucket))
    raise e

We have this setup in production environment and running successfully the last 1.5 years.我们在生产环境中进行了此设置,并在过去 1.5 年中成功运行。

Thanks,谢谢,

Yuva尤瓦

If your source is s3 then you don't even need to run your Glue job every hour to determine if there is any uploads/change to files in source s3 bucket.如果您的源是 s3,那么您甚至不需要每小时运行一次 Glue 作业来确定源 s3 存储桶中的文件是否有任何上传/更改。

You can leverage s3 lambda trigger which will actually trigger your Glue job if there is any upload to s3.您可以利用 s3 lambda 触发器,如果​​有任何上传到 s3,它实际上会触发您的 Glue 作业。 Once the lambda triggers then you can start your Glue job.一旦 lambda 触发,您就可以开始您的 Glue 作业。 Check out this video for more.查看视频了解更多信息。

This way you can only run your Glue job only when there is upload instead every hour.这样,您只能在每小时上传一次时才能运行 Glue 作业。

If you still want to run your Glue job every hour then you can use Glue job bookmarking which only process latest data every run.如果您仍然希望每小时运行一次 Glue 作业,那么您可以使用 Glue 作业书签,它只在每次运行时处理最新数据。

The psuedocode you have outlined could work as I've run similar jobs in the past.您概述的伪代码可以工作,因为我过去运行过类似的工作。

However, I discovered that Glue jobs used this way are expensive as you will be charged for the 1st 10 mins block of usage even though your job ran for less than a minute (especially in the case where there are no files).但是,我发现以这种方式使用的 Glue 作业很昂贵,因为即使您的作业运行不到一分钟(尤其是在没有文件的情况下),您仍需为第一个 10 分钟的使用块付费。

An alternative with better cost (but more complexity as you will use S3 events, SQS, and Lambda in concert) is to do the following:成本更高的替代方案(但更复杂,因为您将同时使用 S3 事件、SQS 和 Lambda)是执行以下操作:

  1. Setup an Events notification in S3 watching for PUT events in the relevant folder that will send a message to SQS (Simple Queue Service).在 S3 中设置一个事件通知,监视相关文件夹中的 PUT 事件,该事件将向 SQS(简单队列服务)发送消息。
  2. Set the SQS queue's message retention period to 1 hour (or whatever period you are running the Glue job).将 SQS 队列的消息保留期设置为 1 小时(或您运行 Glue 作业的任何时间段)。 This way, messages are only in the queue for up to 1 hour.这样,消息最多只能在队列中停留 1 小时。
  3. Create a Lambda job that will check the SQS queue for messages (using boto3).创建一个 Lambda 作业来检查 SQS 队列中的消息(使用 boto3)。 Basically you will put the pseudocode you have in Lambda instead of Glue.基本上,您会将您拥有的伪代码放在 Lambda 而不是 Glue 中。 If there are messages (which means at least 1 file has arrived in that period), trigger the Glue job to process.如果有消息(这意味着该时间段内至少有 1 个文件已到达),则触发 Glue 作业进行处理。 If not, do thing and exit.如果没有,请执行操作并退出。

The above method will save you $$.上述方法将为您节省$$。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM