简体   繁体   English

从 S3 存储桶读取的 AWS Glue 限制数据

[英]AWS Glue Limit data read from S3 Bucket

I have a large bucket that contains more than 6M files.我有一个包含超过 6M 文件的大存储桶。 I've run into this error Failed to sanitize XML document destined for handler class and i think this is the problem: https://github.com/lbroudoux/es-amazon-s3-river/issues/16我遇到了这个错误Failed to sanitize XML document destined for handler class我认为这是问题所在: Z5E056C500A1C4B6A7110B50D807BADE-5Zzon-s3-riverbroudo/issues-5Zzon-s3-lriver/issues-5Zzon-s3-l/

Is there a way I can limit how many files are read in the first runs?有没有办法可以限制第一次运行时读取的文件数量?

This is what I have DataSource0 = glueContext.create_dynamic_frame.from_catalog(database = "s3-sat-dth-prd", table_name = "datahub_meraki_user_data", transformation_ctx = "DataSource0") , can I tell it to read only a folder in my bucket?这就是我拥有的DataSource0 = glueContext.create_dynamic_frame.from_catalog(database = "s3-sat-dth-prd", table_name = "datahub_meraki_user_data", transformation_ctx = "DataSource0") ,我可以告诉它只读取我存储桶中的一个文件夹吗? Every folder within is called like this: partition=13/ , partition=14/ , partition=n/ and so on.里面的每个文件夹都是这样调用的: partition=13/partition=14/partition=n/等等。

How can I work around this?我该如何解决这个问题?

Thanks in advance.提前致谢。

There are three main ways (as I know) to avoid this situation.有三种主要方法(据我所知)可以避免这种情况。

1. Load from a prefix 1.从前缀加载

In order to load files from a specific path in AWS Glue, you can use the below syntax.为了从 AWS Glue 中的特定路径加载文件,您可以使用以下语法。

from awsglue.dynamicframe import DynamicFrame

dynamic_frame = context.create_dynamic_frame_from_options(
        "s3",
        {
            'paths': ['s3://my_bucket_1/my_prefix_1'],
            'recurse': True,
            'groupFiles': 'inPartition',
            'groupSize': '1073741824'
        },
        format='json',
        transformation_ctx='DataSource0'
    )

You can put multiple paths for paths and Glue will load from all of them.您可以为路径放置多个paths ,Glue 将从所有路径中加载。

2. Use Glue Bookmarks. 2.使用胶水书签。

When you have millions of files in a bucket and you want to load only the new files (between the runs of your Glue job), you can enable Glue Bookmarks.当您在存储桶中有数百万个文件并且您只想加载新文件(在 Glue 作业运行之间)时,您可以启用 Glue 书签。 It will keep track of the files it read in an internal index (which we don't have access to).它将跟踪它在内部索引中读取的文件(我们无权访问)。 You can pass this as a parameter when you define the job.您可以在定义作业时将其作为参数传递。


  MyJob:
    Type: AWS::Glue::Job
    Properties:
      ...
      GlueVersion: 2.0
      Command:
        Name: glueetl
        PythonVersion: 3
        ...
      DefaultArguments: {
        "--job-bookmark-option": job-bookmark-enable,
        ...

This will enable bookmarks defined with the name used for transformation_ctx when you load data.这将启用在加载数据时使用用于transformation_ctx的名称定义的书签。 Yes, it's confusing that AWS uses the same parameter for multiple purposes!是的,AWS 将相同的参数用于多种用途,这令人困惑!

It's also important that you must not forget to add job.commit() at the end of your Glue script, where job is your from awsglue.job import Job instance.同样重要的是,您一定不要忘记在 Glue 脚本的末尾添加job.commit() ,其中job是您的from awsglue.job import Job实例。

Then, when you use the same context.create_dynamic_frame_from_options() function with your root prefix and the same transformation_ctx , it will only load the new files in the prefix in the hierarchy.然后,当您使用相同的context.create_dynamic_frame_from_options() function 和您的根前缀和相同的transformation_ctx时,它只会加载层次结构中前缀中的新文件。 It saves a lot of hassle for us in looking for new files.它为我们节省了很多寻找新文件的麻烦。 Read the docs for more information on bookmarks.阅读文档以获取有关书签的更多信息。

3. Avoid smaller file sizes. 3. 避免较小的文件大小。

AWS Glue will take ages to load files if you have quite smaller files.如果您的文件非常小,AWS Glue 将需要很长时间才能加载文件。 So, if you can control the file size, then make the files at least 100MB in size.因此,如果您可以控制文件大小,则使文件大小至少为 100MB。 For instance, we were writing to S3 from a Firehose stream and we could adjust the buffer size to avoid smaller file sizes.例如,我们正在从 Firehose stream 写入 S3,我们可以调整缓冲区大小以避免较小的文件大小。 This drastically increased the loading times for our Glue job.这大大增加了胶水作业的加载时间。

I hope these tips will help you.我希望这些提示对您有所帮助。 And feel free to ask any questions if you need further clarification.如果您需要进一步说明,请随时提出任何问题。

There is a way to control the # of files called a BoundedExecution.有一种方法可以控制文件的数量,称为 BoundedExecution。 It's documented here: https://docs.aws.amazon.com/glue/latest/dg/bounded-execution.html它记录在这里: https://docs.aws.amazon.com/glue/latest/dg/bounded-execution.html

In the following examples you would be loading in 200 files at a time.在以下示例中,您将一次加载 200 个文件。 Note you must enable Glue bookmarks for this to work correctly.请注意,您必须启用 Glue 书签才能正常工作。

If you are using from_options it looks like this:如果您使用 from_options 它看起来像这样:

    DataSource0 = glueContext.create_dynamic_frame.from_options(
        format_options={"withHeader": True, "separator": separator, "quoteChar": quoteChar},
        connection_type="s3",
        format="csv",
        connection_options={"paths": inputFilePath,
                            "boundedFiles": "200", "recurse": True},
        transformation_ctx="DataSource0"
    )

If you are using from_catalog it looks like this:如果您使用 from_catalog 它看起来像这样:

    DataSource0 = glueContext.create_dynamic_frame.from_catalog(
        database = "database-name",
        table_name= "table-name",
        additional_options={"boundedFiles": "200"},
        transformation_ctx="DataSource0"
    )

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM