简体   繁体   English

如何使用 AWS Glue 在发电机中编写字符串集?

[英]How to write string set in dynamo with AWS Glue?

I need to copy data from one dynamo table to another and do some transformation along the way.我需要将数据从一个发电机表复制到另一个,并在此过程中进行一些转换。 For that, I exported data from the source table to s3 and ran crawler over it.为此,我将数据从源表导出到 s3 并在其上运行爬虫。 In my Glue Job I'm using following code:在我的胶水作业中,我使用以下代码:

mapped = apply_mapping.ApplyMapping.apply(
    frame=source_df,
    mappings=[
        ("item.uuid.S", "string", "uuid", "string"),
        ("item.options.SS", "set", "options", "set"),
        ("item.updatedAt.S", "string", "updatedAt", "string"),
        ("item.createdAt.S", "string", "createdAt", "string")
    ],
    transformation_ctx='mapped'
)
df = mapped.toDF() //convert to spark df
// apply some transformation
target_df = DynamicFrame.fromDF(df, glue_context, 'target_df') //convert to dynamic frame
glue_context.write_dynamic_frame_from_options(
    frame=target_df,
    connection_type="dynamodb",
    connection_options={
        "dynamodb.region": "eu-west-1",
        "dynamodb.output.tableName": "my-table",
        "dynamodb.throughput.write.percent": "1.0"
    }
)

In the source dynamo table the options field is a String Set.在源发电机表中, options字段是一个字符串集。 In transformation, it remains untouched.在转型中,它保持不变。 However, in the target table is a list of strings:但是,目标表中有一个字符串列表:

"options": {
    "L": [
      {
        "S": "option A"
      },
      {
        "S": "option B"
      }
    ]
  }

Could anyone advise how to write a string set into DynamoDB using AWS Glue?谁能建议如何使用 AWS Glue 将字符串集写入 DynamoDB?

Unfortunately, I couldn't find a way to write string sets to DynamoDB using Glue interfaces.不幸的是,我找不到使用 Glue 接口将字符串集写入 DynamoDB 的方法。 I've found some solutions using boto3 with Spark so here is my solution.我找到了一些使用 boto3 和 Spark 的解决方案,所以这是我的解决方案。 I skipped the transformation part and simplified the example in general.我跳过了转换部分并总体上简化了示例。

# Load source data from catalog
source_dyf = glue_context.create_dynamic_frame_from_catalog(
        GLUE_DB, GLUE_TABLE, transformation_ctx="source"
    )

# Map dynamo attributes
mapped_dyf = ApplyMapping.apply(
    frame=source_dyf,
    mappings=[
        ("item.uuid.S", "string", "uuid", "string"),
        ("item.options.SS", "set", "options", "set"),
        ("item.updatedAt.S", "string", "updatedAt", "string"),
        ("item.updatedAt.S", "string", "createdAt", "string")
    ],
    transformation_ctx='mapped'
)


def _putitem(items):
    resource = boto3.resource("dynamodb")
    table = resource.Table("new_table")
    with table.batch_writer() as batch_writer:
        for item in items:
            batch_writer.put_item(Item=item)


df = mapped_dyf.toDF()
# Apply spark transformations ...

# save partitions to dynamo
df.rdd.mapPartitions(_putitem).collect()

Depends on your data volume you might want to increase the number of retries in boto3 or even change the mechanism .根据您的数据量,您可能希望增加 boto3 中的重试次数,甚至更改机制 Also, you might want to play with DynamoDB Provisioning.此外,您可能想要使用 DynamoDB 配置。 I switched to on-demand to run this particular migration, but there is a catch我切换到按需运行此特定迁移,但有一个问题

You can try using the ResolveChoice class to convert the datatype您可以尝试使用ResolveChoice class 来转换数据类型

There are 4 different types, a column with the ambiguous type can be converted into.有 4 种不同的类型,类型不明确的列可以转换成。

Something like this might help:这样的事情可能会有所帮助:

resolvedMapping = ResolveChoice.apply(mapped , specs = [("item.options.SS", "make_struct")])

You can refer the link for details:您可以参考链接了解详情:

https://github.com/aws-samples/aws-glue-samples/blob/master/examples/resolve_choice.md https://github.com/aws-samples/aws-glue-samples/blob/master/examples/resolve_choice.md

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM