[英]App Engine: Best way to check for updates to data in data store, while avoiding datastore writes
I have a large amount of entities (products) in my datastore which come from a external data source.我的数据存储中有大量来自外部数据源的实体(产品)。 I want to check them for updates daily.
我想每天检查他们的更新。
Some items are already updated because the application fetched them directly.有些项目已经更新,因为应用程序直接获取它们。 Some are newly inserted and don´t need updates.
有些是新插入的,不需要更新。
For ones which have not been fetched I have cron jobs running.对于尚未获取的那些,我正在运行 cron 作业。 I use the Python API.
我使用 Python API。
At the moment I do the following.目前,我执行以下操作。
I have a field我有一个领域
dateupdated = db.DateTimeProperty(auto_now_add=True)
I can then use然后我可以使用
query = dbmodel.product.all()
query.filter('dateupdated <', newdate)
query.order('dateupdated')
results = query.fetch(limit=mylimit, offset=myoffset)
to pick the oldest entries and schedule them for update.选择最旧的条目并安排它们进行更新。 I used the Task Queue with custom task names to make sure each product update is only run once a day.
我使用带有自定义任务名称的任务队列来确保每个产品更新每天只运行一次。
The problem is, that I need to update the field dateupdated, which means a datastore write, even if a product´s data was not changed, just to keep track of the update process.问题是,我需要更新字段 dateupdated,这意味着数据存储写入,即使产品的数据没有更改,也只是为了跟踪更新过程。
This consumes lots of ressources (CPU hours, Datastore API calls, etc.).这会消耗大量资源(CPU 小时数、Datastore API 调用等)。
Is there a better way to perform such a task and avoid the unnecessary datastore writes?有没有更好的方法来执行这样的任务并避免不必要的数据存储写入?
By ordering a query by dateupdated
and then storing a cursor after you have processed your entities, you can re-run the same query later to get only the items updated after your last query.通过按
dateupdated
排序查询,然后在处理完实体后存储 cursor,您可以稍后重新运行相同的查询以仅获取上次查询后更新的项目。
So, given a class like所以,给定一个 class 像
class MyEntity(db.model):
dateupdated = db.DateTimeProperty(auto_now_add=True)
You could setup a handler to be run as a task like:您可以设置一个处理程序作为任务运行,例如:
class ProcessNewEntities(webapp.RequestHandler):
def get(self):
"""Run via a task to process batches of 'batch_size'
recently updated entities"""
# number of eneities to process per task execution
batch_size = 100
# build the basic query
q = MyEntity.all().order("dateupdated")
# use a cursor?
cursor = self.request.get("cursor")
if cursor:
q.with_cursor(cursor)
# fetch the batch
entities = q.fetch(batch_size)
for entity in entities:
# process the entity
do_your_processing(entity)
# queue up the next task to process the next 100
# if we have no more to process then delay this task
# for a while so that it doesn't hog the application
delay = 600 if len(entities)<batch_size else 0
taskqueue.add(
url='/tasks/process_new_entities',
params={'cursor': q.cursor()},
countdown=delay)
and then you just need to trigger the start of the task execution like:然后您只需要触发任务执行的开始,例如:
def start_processing_entities():
taskqueue.add(url='/tasks/process_new_entities')
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.