从NDB数据存储聚合数据的最佳方法？

Question

我有一个StatisticStore模型定义为：

class StatisticStore(ndb.Model):
  user = ndb.KeyProperty(kind=User)
  created = ndb.DateTimeProperty(auto_now_add=True)
  kind = ndb.StringProperty()
  properties = ndb.PickleProperty()

  @classmethod
  def top_links(cls, user, start_date, end_date):
    '''
    returns the user's top links for the given date range
    e.g.
    {'http://stackoverflow.com': 30,
     'http://google.com': 10,
     'http://yahoo.com': 15}
    '''
    stats = cls.query(
      cls.user == user.key,
      cls.created >= start_date,
      cls.created <= end_date,
      cls.kind == 'link_visited'
    )
    links_dict = {}
    # generate links_dict from stats
    # keys are from the 'properties' property
    return links_dict

我想拥有一个AggregateStatisticStore模型，该模型每天存储StatisticStore的总量。 它可以每天生成一次。 就像是：

class AggregateStatisticStore(ndb.Model):
  user = ndb.KeyProperty(kind=User)
  date = ndb.DateProperty()
  kinds_count = ndb.PickleProperty()
  top_links = ndb.PickleProperty()

因此，以下内容将成立：

start = datetime.datetime(2013, 8, 22, 0, 0, 0)
end = datetime.datetime(2013, 8, 22, 23, 59, 59)

aug22stats = StatisticStore.query(
  StatisticStore.user == user,
  StatisticStore.kind == 'link_visited',
  StatisticStore.created >= start,
  StatisticStore.created <= end
).count()
aug22toplinks = StatisticStore.top_links(user, start, end)

aggregated_aug22stats = AggregateStatisticStore.query(
  AggregateStatisticStore.user == user,
  AggregateStatisticStore.date == start.date()
)

aug22stats == aggregated_aug22stats.kinds_count['link_visited']
aug22toplinks == aggregated_aug22stats.top_links

我当时正在考虑仅使用taskqueue API运行cronjob。 该任务将每天生成AggregateStatisticStore 。 但是我担心它可能会遇到内存问题吗？ 视作StatisticStore ，每个用户可能有很多记录。

同样， top_links属性有点使事情复杂化。 我不确定聚合模型中是否具有它的属性是最好的方法。 关于该财产的任何建议都会很棒。

最终，我只想拥有StatisticStore的记录，直到大约30天前。 如果记录早于30天，则应将其汇总（然后删除）。 节省空间并缩短查询时间以进行可视化。

编辑：每次记录StatisticStore怎么样，它会创建/更新适当的AggregateStatisticStore记录。 这样，cronjob要做的就是清理。 思考？

Answer 1

是的，mapreduce会对此有所帮助。 或者，您可以使用“后端”（现在为模块）实例运行cron作业。 这可以缓解内存问题和作业长度问题。

另一种方法可能是将聚合移动到写入时间。 由于这是针对每个用户的，因此您可能会发现以这种方式可以省去很多工作。 如果AggregateStatisticStore是每天，则您可能要使用DateProperty以外的日期作为日期。 DateProperty当然可以工作，但是我发现将IntegerProperty用于这种事情是比较容易的，因为int只是“一段时间以来”。

Answer 2

与汇总数据有关：

更改StatisticStore和AggregateStatisticStore以将user.key作为其父项。 这意味着消除user = ndb.KeyProperty(kind=User)从每个模型，产生每一个与parent = user.key和使用parent = user.key在query()秒。 NDB擅长与同一父级聚合数据。

Answer 3

如果AggregateStatisticScore彼此独立，则无需使用MapReduce。 如果您可以为每个用户运行一个循环，则只需为每个用户运行一个任务队列过程并编写一条记录。 实际上，这只是“地图”阶段。

如果您可以将其进一步分解为更多并行任务，则可以创建更多任务队列过程。 “平行化”了！

从NDB数据存储聚合数据的最佳方法？

问题描述

3 个解决方案

解决方案1
1 2013-08-22 14:02:46

解决方案2
0 2013-08-22 21:00:44

解决方案3
0 2013-08-22 21:11:00

从NDB数据存储聚合数据的最佳方法？

问题描述

3 个解决方案

解决方案1 1 2013-08-22 14:02:46

解决方案2 0 2013-08-22 21:00:44

解决方案3 0 2013-08-22 21:11:00

解决方案1
1 2013-08-22 14:02:46

解决方案2
0 2013-08-22 21:00:44

解决方案3
0 2013-08-22 21:11:00