简体   繁体   English

AWS Glue流程可以按行记录吗

[英]Can AWS Glue process records row wise

I have requirement to process records from one redshift cluster to another row wise. 我需要处理从一个红移群集到另一行的记录。 We want to process row wise because we want to handle failed/invalid records in different way. 我们要明智地处理行,因为我们要以不同的方式处理失败/无效的记录。 And other benefit is we want to avoid batch reprocessing in case of one record failure. 另一个好处是我们希望避免在一个记录失败的情况下进行批量重新处理。 So, wanted to check if AWS Glue is suitable for that or not? 因此,是否想检查AWS Glue是否适合? If this is not suitable any other tool which provides row processing functionality? 是否不合适,是否有其他提供行处理功能的工具?

AWS glue allows you to implement your own PySpark scripts as part of the transformation process. AWS胶水允许您在转换过程中实施自己的PySpark脚本。

Pyspark allows implementation of a function to run against each row. Pyspark允许对每行运行一个函数的实现。

There are many ways to do this, for example: 有很多方法可以做到这一点,例如:

def f_udf(x):
    return (x + 1)
df2 = df.withColumn("result", max_udf(df.col1))

thi runs the function f_udf for each row of df and produces df2. 这将为df的每一行运行函数f_udf并生成df2。

AWS Glue specific documentation on this can be found here 可以在此处找到有关AWS Glue的特定文档

https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-crawler-pyspark-extensions-dynamic-frame.html#aws-glue-api-crawler-pyspark-extensions-dynamic-frame-map https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-crawler-pyspark-extensions-dynamic-frame.html#aws-glue-api-crawler-pyspark-extensions-dynamic-帧映射

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM