简体   繁体   English

将文件复制并合并到另一个 S3 存储桶

[英]Copy and Merge files to another S3 bucket

I have a source bucket where small 5KB JSON files will be inserted every second.我有一个源存储桶,每秒将插入 5KB JSON 小文件。 I want to use AWS Athena to query the files by using an AWS Glue Datasource and crawler.我想使用 AWS Athena 通过 AWS Glue 数据源和爬网程序查询文件。 For better query performance AWS Athena recommends larger file sizes .为了获得更好的查询性能,AWS Athena 建议使用更大的文件大小

So I want to copy the files from the source bucket to bucket2 and merge them.所以我想将文件从源存储桶复制到bucket2并合并它们。

I am planning to use S3 events to put a message in AWS SQS for each file created, then a lambda will be invoked with a batch of x sqs messages, read the data in those files, combine and save them to the destination bucket.我计划使用 S3 事件为创建的每个文件在 AWS SQS 中放置一条消息,然后将使用一批 x sqs 消息调用 lambda,读取这些文件中的数据,将它们合并并保存到目标存储桶。 bucket2 then will be the source of the AWS Glue crawler. bucket2 然后将成为 AWS Glue 爬虫的来源。

Will this be the best approach or am I missing something?这是最好的方法还是我遗漏了什么?

Instead of receiving 5KB JSON file every second in Amazon S3, the best situation would be to receive this data via Amazon Kinesis Data Firehose , which can automatically combine data based on either size or time period .最好的情况是通过Amazon Kinesis Data Firehose接收此数据,而不是在 Amazon S3 中每秒接收 5KB JSON 文件,它可以根据大小或时间段自动组合数据 It would output fewer, larger files.它会输出更少、更大的文件。

You could also achieve this with a slight change to your current setup:您也可以通过对当前设置稍作更改来实现此目的:

  • When a file is uploaded to S3, trigger an AWS Lambda function当文件上传到 S3 时,触发 AWS Lambda 函数
  • The Lambda function reads the file and send it to Amazon Kinesis Data Firehose Lambda 函数读取文件并将其发送到 Amazon Kinesis Data Firehose
  • Kinesis Firehose then batches the data by size or time Kinesis Firehose 然后按大小或时间对数据进行批处理

Alternatively, you could use Amazon Athena to read data from multiple S3 objects and output them into a new table that uses Snappy-compressed Parquet files .或者,您可以使用 Amazon Athena 从多个 S3 对象读取数据并将它们输出到使用Snappy 压缩 Parquet 文件的新表中。 This file format is very efficient for querying.这种文件格式对于查询非常有效。 However, your issue is that the files are arriving every second so it is difficult to query the incoming files in batches (so you know which files have been loaded and which ones have not been loaded).然而,你的问题是,文件正在到达每秒因此很难查询分批进入的文件(让你知道哪些文件已加载,哪些还没有被加载)。 A kludge could be a script that does the following: kludge 可以是执行以下操作的脚本:

  • Create an external table in Athena that points to a batching directory (eg batch/ )在 Athena 中创建一个指向批处理目录的外部表(例如batch/
  • Create an external table in Athena that points to the final data (eg final/ )在 Athena 中创建一个指向最终数据的外部表(例如final/
  • Have incoming files come into incoming/让传入的文件进入incoming/
  • At regular intervals, trigger a Lambda function that will list the objects in incoming/ , copy them to batch/ and delete those source objects from incoming/ (any objects that arrive during this copy process will be left for the next batch)定期触发一个 Lambda 函数,该函数将列出incoming/中的对象,将它们复制到batch/并从incoming/删除这些源对象(在此复制过程中到达的任何对象都将留给下一个批处理)
  • In Athena, run INSERT INTO final SELECT * FROM batch在 Athena 中,运行INSERT INTO final SELECT * FROM batch
  • Delete the contents of the batch/ directory删除batch/目录的内容

This will append the data into the final table in Athena, in a format that is good for querying.这会将数据以适合查询的格式附加到 Athena 中的final表中。

However, the Kinesis Firehose option is simpler, even if you need to trigger Lambda to send the files to the Firehose.但是,Kinesis Firehose 选项更简单,即使您需要触发 Lambda 将文件发送到 Firehose。

This is what I think will be more simpler这是我认为会更简单的

  1. Have input folder input/ let 5kb/ 1kb files land here;有input文件夹input/让5kb/1kb的文件在这里落地; /data we will use this to have Json files with max size of 200MB. /data我们将使用它来拥有 Json 个文件,最大大小为 200MB。
  2. Have a lambda that runs every 1minute which reads a set of files from input/ and appends to the last file in the folder /data using golang/ java.有一个每 1 分钟运行一次的 lambda,它从input/读取一组文件,并使用 golang/ java 附加到文件夹/data中的最后一个文件。
  3. The lambda (with max concurrency as 1) copies a set of 5kb files from input/ and the XMB files from data/ folder into its /tmp folder; lambda(最大并发为 1)从input/复制一组 5kb 文件,从data/文件夹复制XMB文件到它的/tmp文件夹; and merge them and then upload the merged file to /data and also delte the files from input/ folder并合并它们,然后将合并后的文件上传到/data并从input/文件夹中删除文件
  4. When ever the file size crosses 200MB create a new file into data/ folder当文件大小超过200MB时,在data/文件夹中创建一个新文件

The advantage here is at any instant if somebody wants data its the union of input/ and data/ folder or in other words这里的优势是在任何时候,如果有人想要数据,它是input/data/文件夹的联合,或者换句话说

With little tweeks here and there you can expose a view on top of input and data folders which can expose final de-duplicated snapshot of the final data.只需几个星期,您就可以在inputdata文件夹之上公开一个视图,它可以公开最终数据的最终去重快照。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 将 json 文件从一个 s3 存储桶复制到另一个 s3 存储桶时,无法识别 Json 文件? - Json file is not recognising when copy json files from one s3 bucket to another s3 bucket? AWS Lambda 尝试将文件从 S3 存储桶复制到另一个 S3 存储桶时出现无效存储桶名称错误 - Invalid bucket name error when AWS Lambda tries to copy files from an S3 bucket to another S3 bucket 如何不断复制新的 S3 文件到另一个 S3 存储桶 - How to continuously copy new S3 files to another S3 bucket 将文件从一个帐户中的 AWS S3 存储桶复制到 terraform/python 中另一个帐户中的存储桶 - copy files from AWS S3 bucket in one account to bucket in another account in terraform/python 将文件从一个 AWS 帐户的 S3 存储桶复制到另一个 AWS 帐户的 S3 存储桶 + 使用 NodeJS - Copy files from one AWS account's S3 bucket to another AWS account's S3 bucket + using NodeJS 使用 Java (Amazon S3) 将 all.txt 文件从一个 object 复制到另一个但在同一个存储桶中 - Copy all .txt files from one object to another but in the same bucket using Java (Amazon S3) 是否可以在不使用存储桶策略的情况下将 s3 存储桶内容从一个存储桶复制到另一个帐户 s3 存储桶? - is it possible to copy s3 bucket content from one bucket to another account s3 bucket without using bucket policy? 从一个 s3 存储桶中获取 zip 个文件,将它们解压缩到另一个 s3 存储桶中 - get zip files from one s3 bucket unzip them to another s3 bucket 如何将所有对象从一个 Amazon S3 存储桶复制到另一个存储桶? - How can I copy all objects from one Amazon S3 bucket to another bucket? 将本地 Windows 文件夹复制到 S3 存储桶 - Copy an on premise Windows folder to S3 bucket
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM