简体   繁体   English

如何删除GAE资讯提供中找不到的实体

[英]How to delete entities not found in feed on GAE

I am updating and adding items from a feed(which can have about 40000 items) to the datastore 200 items at a time, the problem is that the feed can change and some items might be deleted from the feed. 我正在一次更新Feed中的项目并将其添加到其中(一次可以有大约40000个项目)到数据存储200个项目,问题是Feed可以更改,并且某些项目可能会从Feed中删除。 I have this code: 我有以下代码:

class FeedEntry(db.Model):
    name = db.StringProperty(required=True)

def updateFeed(offset, number=200):
    response = fetchFeed(offset, number)
    feedItems = parseFeed(response)
    feedEntriesToAdd = []
    for item in feedItems:
        feedEntriesToAdd.append(
            FeedEntry(key_name=item.id, name=item.name)
        )
    db.put(feedEntriesToAdd)

How do I find out which items were not in the feed and delete them from the datastore? 如何找出资讯提供中没有的项目,并将其从资料储存库中删除? I thought about creating a list of items(in datastore) and just remove from there all the items that I updated and the ones left will be the ones to delete. 我考虑过要创建一个项目列表(在数据存储区中),然后从那里删除所有我更新过的项目,剩下的就是要删除的项目。 - but that seems rather slow. -但这似乎很慢。

PS: All item.id are unique for that feed item and are consistent. PS:所有item.id对于该Feed项目都是唯一的,并且是一致的。

If you add a DateTimeProperty with auto_now=True , it will record the last modified time of each entity. 如果添加带有auto_now=True的DateTimeProperty,它将记录每个实体的最后修改时间。 Since you update every item in the feed, by the time you've finished they will all have times after the moment you started, so anything with a date before then isn't in the feed any more. 由于您更新了Feed中的每个项目,因此到您完成操作时,它们都会在您开始的那一刻起就有时间了,因此日期之前的所有内容都不再位于Feed中。

Xavier's generation counter is just as good - all we need is something guaranteed to increase between refreshes, and never decrease during a refresh. Xavier的世代计数器同样出色-我们所需要的只是保证在刷新之间增加,而在刷新期间绝不减少。

Not sure from the docs, but I expect a DateTimeProperty is bigger than an IntegerProperty. 从文档中不确定,但是我希望DateTimeProperty大于IntegerProperty。 The latter is a 64 bit integer, so they might be the same size, or it may be that DateTimeProperty stores several integers. 后者是64位整数,因此它们的大小可能相同,也可能是DateTimeProperty存储了几个整数。 A group post suggests maybe it's 10 bytes as opposed to 8. 一组帖子显示可能是10个字节而不是8个字节。

But remember that by adding an extra property that you do queries on, you're adding another index anyway, so the difference in size of the field is diluted as a proportion of the overhead. 但是请记住,通过添加要查询的额外属性,无论如何都将添加另一个索引,因此字段大小的差异会被稀释为开销的一部分。 Further, 40k times a few bytes isn't much even at $0.24/G/month. 此外,即使按0.24美元/ G /月的价格,几个字节的40k倍也不算多。

With either a generation or a datetime, you don't necessarily have to delete the data immediately. 无论是生成时间还是日期时间,您都不必立即删除数据。 Your other queries could filter on date/generation of the most recent refresh, meaning that you don't have to delete data immediately. 您的其他查询可能会根据最新刷新的日期/生成进行过滤,这意味着您不必立即删除数据。 If the feed (or your parsing of it) goes funny and fails to produce any items, or only produces a few, it might be useful to have the last refresh lying around as a backup. 如果提要(或您对它的解析)变得很有趣并且无法产生任何项目,或者仅产生了一些项目,那么保留最后一次刷新作为备份可能很有用。 Depends entirely on the app whether it's worth having. 是否值得拥有完全取决于应用程序。

I would add a generation counter 我会增加一个世代计数器

class FeedEntry(db.Model):
    name = db.StringProperty(required=True)
    generation = db.IntegerProperty(required=True)
def updateFeed(offset, generation, number=200):
    response = fetchFeed(offset, number)
    feedItems = parseFeed(response)
    feedEntriesToAdd = []
    for item in feedItems:
        feedEntriesToAdd.append(
            FeedEntry(key_name=item.id, name=item.name,generation=generation)
        )
    db.put(feedEntriesToAdd)
def deleteOld(generation):
    q = db.GqlQuery("SELECT * FROM FeedEntry " +
            "WHERE generation != :1" ,generation )
    db.delete(generation)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何删除GAE中所有命名空间中的所有实体? - How do I delete all Entities in all namespaces in GAE? GAE和数据存储-使用python过滤和删除实体 - GAE & datastore - filter and delete entities with python 如何在ndb中分页许多实体[GAE / Python] - How to pagination many entities in ndb [GAE / Python] 如何获得10个随机GAE ndb实体? - How to get 10 random GAE ndb entities? 如何阅读受域限制的GAE应用托管的RSS Feed? - How to read a RSS Feed hosted on a domain-restricted GAE app? 如何确认实体已通过GAE的最终一致性保存? - How do I confirm entities are saved with GAE's Eventual Consistency? 如何在GAE Python中读写模型的大量(〜1百万个)实体? - How to read and write large number (~1 Million ) of entities of a model in GAE Python? 如何在GAE Python中制作子实体和父实体? - How do I make child and parent entities in GAE Python? GQL查询(GAE数据存储区Python):如何检索具有相同标签的所有实体以及包含子字符串的所有实体? - GQL queries (GAE datastore Python): how to retrieve all entities with same tag and all entities containing a substring? 如何为CoreNLP提供一些预先标记的命名实体? - How to feed CoreNLP some pre-labeled Named Entities?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM