简体   繁体   English

使用AWS Kinesis Firehose写入S3存储桶中的特定文件夹

[英]Write to a specific folder in S3 bucket using AWS Kinesis Firehose

I would like to be able to send data sent to kinesis firehose based on the content inside the data. 我希望能够根据数据内容发送发送到kinesis firehose的数据。 For example if I sent this JSON data: 例如,如果我发送了这个JSON数据:

{
   "name": "John",
   "id": 345
}

I would like to filter the data based on id and send it to a subfolder of my s3 bucket like: S3://myS3Bucket/345_2018_03_05. 我想基于id过滤数据并将其发送到我的s3存储桶的子文件夹,如:S3:// myS3Bucket / 345_2018_03_05。 Is this at all possible with Kinesis Firehose or AWS Lambda? Kinesis Firehose或AWS Lambda可以实现这一切吗?

The only way I can think of right now is to resort to creating a kinesis stream for every single one of my possible IDs and point them to the same bucket and then send my events to those streams in my application, but I would like to avoid that since there are many possible IDs. 我现在能想到的唯一方法是为每个可能的ID创建一个kinesis流,并将它们指向同一个存储桶,然后将我的事件发送到我的应用程序中的那些流,但我想避免因为有很多可能的ID。

You probably want to use an S3 event notification that gets fired each time Firehose places a new file in your S3 bucket (a PUT); 您可能希望使用每次Firehose将新文件放入S3存储桶(PUT)时触发的S3事件通知; the S3 event notification should call a custom lambda function that you write that reads the contents of the S3 file and splits it up and writes it out to the separate buckets, keeping in mind that each S3 file is likely going to contain many records, not just one. S3事件通知应调用您编写的自定义lambda函数,该函数读取S3文件的内容并将其拆分并将其写入单独的存储桶,请记住每个S3文件可能包含许多记录,而不是只有一个。

https://aws.amazon.com/blogs/aws/s3-event-notification/ https://aws.amazon.com/blogs/aws/s3-event-notification/

This is not possible out-of-the box, but here's some ideas... 这不是开箱即用的,但这里有一些想法......

You can write a Data Transformation in Lambda that is triggered by Amazon Kinesis Firehose for every record. 您可以在Lambda中编写由Amazon Kinesis Firehose为每条记录触发的数据转换 You could code Lambda to save to save the data to a specific file in S3, rather than having Firehose do it. 您可以编写Lambda代码来保存以将数据保存到S3中的特定文件,而不是让Firehose执行此操作。 However, you'd miss-out on the record aggregation features of Firehose. 但是,你会错过Firehose的记录聚合功能。

You could use Amazon Kinesis Analytics to look at the record and send the data to a different output stream based on the content. 您可以使用Amazon Kinesis Analytics查看记录,并根据内容将数据发送到不同的输出流 For example, you could have a separate Firehose stream per delivery channel, with Kinesis Analytics queries choosing the destination. 例如,您可以为每个投放渠道设置单独的Firehose流,并通过Kinesis Analytics查询选择目的地。

If you use a lambda to save the data you would end up with duplicate data onto s3. 如果使用lambda来保存数据,最终会将重复数据存储到s3上。 One stored by lambda and the other stored by firehose since transformation lambda will add the data back to firehose. 一个由lambda存储,另一个由firehose存储,因为transform lambda会将数据添加回firehose。 Unless there is a way to avoid transformed data from lambda being re-added to the stream. 除非有办法避免将lambda中的转换数据重新添加到流中。 I am not aware of a way to avoid that 我不知道有办法避免这种情况

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM