简体繁体 English

如何在1秒内从GAE数据存储中检索大量（> 2000）数量的实体？

[英]How to retrieve huge (>2000) amount of entities from GAE datastore in under 1 second?

原文 2012-01-05 14:14:06 1 5 java/ performance/ google-app-engine/ google-cloud-datastore/ google-bigquery

We have some part of our application that need to load a large set of data (>2000 entities) and perform computation on this set. 我们的应用程序的一部分需要加载大量数据（> 2000个实体）并在此集合上执行计算。 The size of each entity is approximately 5 KB. 每个实体的大小约为5 KB。

On our initial, naïve, implementation, the bottleneck seems to be the time required to load all the entities ( ~40 seconds for 2000 entities ), while the time required to perform the computation itself is very small (<1 second). 在我们最初的，天真的实现中，瓶颈似乎是加载所有实体所需的时间（ 2000个实体约为40秒 ），而执行计算所需的时间本身非常小（<1秒）。

We had tried several strategies to speed up the entities retrieval: 我们尝试了几种策略来加速实体检索：

Splitting the retrieval request into several parallel instances and then merging the result: ~20 seconds for 2000 entities . 将检索请求拆分为多个并行实例，然后合并结果： 2000个实体约20秒 。

Storing the entities at an in-memory cache placed on a resident backend: ~5 seconds for 2000 entities . 将实体存储在驻留后端的内存缓存中： 2000个实体约5秒 。

The computation needs to be dynamically computed, so doing a precomputation at write time and storing the result does not work in our case. 计算需要动态计算，因此在写入时进行预计算并存储结果在我们的情况下不起作用。

We are hoping to be able to retrieve ~2000 entities in just under one second. 我们希望能够在不到一秒的时间内检索出~2000个实体。 Is this within the capability of GAE/J? 这是否在GAE / J的能力范围内？ Any other strategies that we might be able to implement for this kind of retrieval? 我们可以为这种检索实现的任何其他策略？

UPDATE: Supplying additional information about our use case and parallelization result: 更新：提供有关我们的用例和并行化结果的其他信息：

We have more than 200.000 entities of the same kind in the datastore and the operation is retrieval-only. 我们在数据存储区中有超过200,000个相同类型的实体，并且操作仅检索。

We experimented with 10 parallel worker instances, and a typical result that we obtained could be seen in this pastebin . 我们尝试了10个并行工作器实例，我们获得的典型结果可以在这个pastebin中看到。 It seems that the serialization and deserialization required when transferring the entities back to the master instance hampers the performance. 似乎将实体传回主实例时所需的序列化和反序列化会妨碍性能。

UPDATE 2: Giving an example of what we are trying to do: 更新2：举一个我们想要做的事情的例子：

Let's say that we have a StockDerivative entity that need to be analyzed to know whether it's a good investment or not. 假设我们有一个StockDerivative实体需要进行分析才能知道它是否是一项好的投资。

The analysis performed requires complex computations based on many factors both external (eg user's preference, market condition) and internal (ie from the entity's properties), and would output a single "investment score" value. 所执行的分析需要基于外部（例如，用户的偏好，市场条件）和内部（即来自实体的属性）的许多因素的复杂计算，并且将输出单个“投资分数”值。

The user could request the derivatives to be sorted based on its investment score and ask to be presented with N-number of highest-scored derivatives. 用户可以根据其投资分数请求衍生品进行分类，并要求获得N个最高得分的衍生品。

5 个解决方案

200.000 by 5kb is 1GB. 200.000乘5kb是1GB。 You could keep all this in memory on the largest backend instance or have multiple instances. 您可以将所有这些内容保存在最大的后端实例上或具有多个实例。 This would be the fastest solution - nothing beats memory. 这将是最快的解决方案 - 没有什么比记忆更好。

Do you need the whole 5kb of each entity for computation? 你需要整个5kb的每个实体进行计算吗？ Do you need all 200k entities when querying before computation? 在计算之前查询时是否需要所有200k实体？ Do queries touch all entities? 查询是否触及所有实体？

Also, check out BigQuery . 另外，请查看BigQuery 。 It might suit your needs. 它可能适合您的需求。

Use Memcache . 使用Memcache 。 I cannot guarantee that it will be sufficient, but if it isn't you probably have to move to another platform. 我不能保证它足够了，但如果不是，你可能不得不转移到另一个平台。

In the end, it does not appear that we could retrieve >2000 entities from a single instance in under one second, so we are forced to use in-memory caching placed on our backend instance, as described in the original question. 最后，似乎我们不能在一秒钟内从单个实例中检索> 2000个实体，因此我们被迫使用放置在我们后端实例上的内存缓存，如原始问题中所述。 If someone comes up with a better answer, or if we found a better strategy/implementation for this problem, I would change or update the accepted answer. 如果有人提出了更好的答案，或者我们为此问题找到了更好的策略/实施，我会更改或更新已接受的答案。

Our solution involves periodically reading entities in a background task and storing the result in a json blob. 我们的解决方案涉及定期读取后台任务中的实体并将结果存储在json blob中。 That way we can quickly return more than 100k rows. 这样我们就可以快速返回超过10万行。 All filtering and sorting is done in javascript using SlickGrid's DataView model. 所有过滤和排序都是使用SlickGrid的DataView模型在javascript中完成的。

As someone has already commented, MapReduce is the way to go on GAE. 正如有人已经评论过的那样，MapReduce是继续GAE的方式。 Unfortunately the Java library for MapReduce is broken for me so we're using non optimal task to do all the reading but we're planning to get MapReduce going in the near future (and/or the Pipeline API). 不幸的是，MapReduce的Java库对我来说是破碎的，因此我们使用非最佳任务来完成所有读取，但我们计划在不久的将来（和/或Pipeline API）使用MapReduce。

Mind that, last time I checked, the Blobstore wasn't returning gzipped entities > 1MB so at the moment we're loading the content from a compressed entity and expanding it into memory, that way the final payload gets gzipped. 请注意，上次我检查时，Blobstore没有返回gzip压缩实体> 1MB所以当前我们从压缩实体加载内容并将其扩展到内存中，这样最终的有效负载就会被gzip压缩。 I don't like that, it introduces latency, I hope they fix issues with GZIP soon! 我不喜欢它，它引入了延迟，我希望他们很快解决GZIP的问题！

This is very interesting, but yes, its possible & Iv seen some mind boggling results. 这是非常有趣的，但是，它可能和我看到一些令人难以置信的结果。

I would have done the same; 我会做同样的事; map-reduce concept map-reduce概念

It would be great if you would provide us more metrics on how many parallel instances do you use & what are the results of each instance? 如果您能为我们提供更多关于您使用多少并行实例以及每个实例的结果的指标，那将会很棒？

Also, our process includes retrieval alone or retrieval & storing ? 此外，我们的流程包括单独检索或检索和存储？

How many elements do you have in your data store? 您的数据存储中有多少个元素？ 4000? 4000？ 10000? 10000？ Reason is because you could cache it up from the previous request. 原因是因为您可以从之前的请求缓存它。

regards 问候