简体   繁体   English

如何安全地从 kinesis firehose s3 存储桶中删除有问题的文档?

[英]how do i safely remove buggy documents from kinesis firehose s3 bucket?

I've set up a kinesis firehose for others to send me data on, and noticed that occasionally the data occasionally is malformed.我已经设置了一个 kinesis firehose 供其他人向我发送数据,并注意到偶尔数据偶尔会出现格式错误。 The malformed docs fail to properly ETL into redshift - they end up being left in the intermediary Firehose S3 bucket, where they keep generating spammy error messages, referencing the STL_LOAD_ERRORS table格式错误的文档无法正确 ETL 转换为红移 - 它们最终被留在中间 Firehose S3 存储桶中,在那里它们不断生成垃圾邮件错误消息,引用 STL_LOAD_ERRORS 表

Is there a safe way to remove the problematic records from the S3 bucket?有没有一种安全的方法可以从 S3 存储桶中删除有问题的记录? Or any other safe way to clean up the malformed records?或者任何其他安全的方法来清理格式错误的记录?

-- ——

Note that I've already tried simply deleting the malformed records from S3.请注意,我已经尝试简单地从 S3 中删除格式错误的记录。 This seems to put in Kinesys Firehose into an infinite loop, generating error spam with the message: "One or more S3 files required by Redshift have been removed from the S3 bucket".这似乎使 Kinesys Firehose 陷入无限循环,生成带有消息的错误垃圾邮件:“一个或多个 Redshift 所需的 S3 文件已从 S3 存储桶中删除”。 As far as I can tell, this spam is supposed to eventually stop, but in my experiments it seems to keep going without break.据我所知,这种垃圾邮件最终应该会停止,但在我的实验中,它似乎一直在不停地进行。

Here is what will work.这是有效的方法。

  1. STL_Load_Errors table will give you the filename in S3 along with the linenumber and reason of the error. STL_Load_Errors 表将为您提供 S3 中的文件名以及错误的行号和原因。
  2. Find the erroneous record and correct it and re stream it from the source via firehose.找到错误的记录并纠正它,然后通过 firehose 从源重新流式传输它。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM