简体   繁体   English

通过php中的速率限制清除dynamo db中的旧数据

[英]Purge old data in dynamo db by rate limiting in php

I have dataset in dynamodb, whose primary key is user ID, and timestamp is one of the data attribute. 我在dynamodb中有一个数据集,其主键是用户ID,时间戳是data属性之一。 I want to run a purge query on this table, where timestamp is older than 1 week. 我想对此时间戳超过1周的表运行清除查询。

I do not want to eat up all writes per s units. 我不想吃完每单位的所有写入。 I would ideally want a rate limiting delete operation(in php). 理想情况下,我希望进行限速删除操作(在php中)。 Otherwise for a dataset that's 10sof GBs in size, it will stop other writes. 否则,对于大小为10sof GB的数据集,它将停止其他写入操作。

I was wondering on lines of usingglobal secondary indexing on timestamp (+user ID) would help reduce the rows to be scanned. 我想知道在时间戳(+用户ID)上使用全局二级索引的行是否有助于减少要扫描的行。 But again, I'd not want to thrash table such that other writes start failing. 但是再次,我不想破坏表,以免其他写入开始失败。

Can someone provide rate limiting insert/delete example code and references for this in php? 有人可以在php中提供速率限制插入/删除示例代码和参考吗?

You can create a global secondary index: 您可以创建一个全局二级索引:

timestampHash (number, between 1 and 100)
timestamp (number)

Whenever you create/update your timestamp, also set the timestampHash attribute as a random number between 1 to 100. This will distribute the items in your index evenly. 每当您创建/更新时间戳时,还应将timestampHash属性设置为1到100之间的随机数。这将平均分配索引中的项目。 You need this hash because to do a range query on a GSI, you need a hash. 您需要此哈希,因为要在GSI上进行范围查询,您需要哈希。 Querying by user id and timestamp doesn't seem to make sense because that will only return one item every time and you will have to loop over all your users (assuming there is one item per user id). 按用户ID和时间戳进行查询似乎没有意义,因为每次都只会返回一项,并且您必须遍历所有用户(假设每个用户ID有一项)。

Then you can run a purger that will do a query 100 times for each timestampHash number and all items with timestamp older than 1 week. 然后,您可以运行一个清除程序,该清除程序将对每个timestampHash编号以及所有timestamp超过1周的项目进行100次查询。 Between each run you can wait 5 minutes, or however long you think is appropriate, depending on the number of items you need to purge. 在每次运行之间,您可能需要等待5分钟,或者您认为合适的时长,取决于您需要清除的物品数量。

You can use BatchWriteItem to leverage the API's multithreading to delete concurrently. 您可以使用BatchWriteItem来利用API的多线程来同时删除。

In pseudocode it looks like this: 用伪代码看起来像这样:

while (true) {
    for (int i = 0; i < 100; i++) {
        records = dynamo.query(timestampHash = i, timestamp < Date.now());
        dynamo.batchWriteItem(records, DELETE);
    }
    sleep(5 minutes);
}

You can also catch ProvisionedThroughputExceededException and do an exponential back off so that if you do exceed the throughput, you will reasonably stop and wait until your throughput recovers. 您还可以捕获ProvisionedThroughputExceededException并进行指数回退,以便在超出吞吐量的情况下合理地停止并等到吞吐量恢复。


Another way is to structure structure your tables by time. 另一种方法是按时间对表进行结构化。

TABLE_08292016
TABLE_09052016
TABLE_09122016

All your data for the week of 08/28/2016 will go into TABLE_08292016 . 您在TABLE_08292016当周的所有数据将进入TABLE_08292016 Then at the end of every week you can just drop the table. 然后,在每个星期结束时,您都可以放下桌子。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM