如何设计内存和计算密集型程序以在Google App Engine上运行

Question

I have a problem with my code running on google app engine. 我在Google App Engine上运行的代码有问题。 I dont know how to modify my code to suit GAE. 我不知道如何修改我的代码以适合GAE。 The following is my problem 以下是我的问题

for j in range(n):
 for d in range(j):
  for d1 in range(d):
   for d2 in range(d1):
    # block which runs in O(n^2)

Efficiently the entire code block is O(N^6) and it will run for more than 10 mins depending on n. 有效地，整个代码块为O（N ^ 6），根据n的不同，它将运行10分钟以上。 Thus I am using task queues. 因此，我正在使用任务队列。 I will also be needing a 4 dimensional array which is stored as a list (eg A[j][d][d1][d2]) of nxnxnxn ie needs memory space O(N^4) 我还将需要一个4维数组，该数组存储为nxnxnxn的列表（例如A [j] [d] [d1] [d2]），即需要存储空间O（N ^ 4）

Since the limitation of put() is 10 MB, I cant store the entire array. 由于put（）的限制为10 MB，因此无法存储整个数组。 So I tried chopping into smaller chunks and store it and when retrieve combine them. 因此，我尝试切成小块并将其存储，并在检索时将它们组合在一起。 I used the json function for this but it doesnt support for larger n (> 40). 我为此使用了json函数，但它不支持更大的n（> 40）。

Then I stored the whole matrix as individual entities of lists in datastore ie each A[j][d][d1] entity. 然后，我将整个矩阵存储为数据存储中列表的单个实体，即每个A [j] [d] [d1]实体。 So there is no local variable. 因此没有局部变量。 When i access A[j][d][d1][d2] in my code I would call my own functions getitem and putitem to get and put data from datastore (used caching also). 当我在代码中访问A [j] [d] [d1] [d2]时，我将调用自己的函数getitem和putitem来从数据存储中获取和放入数据（也用于缓存）。 As a result, my code takes more time for computation. 结果，我的代码需要更多的时间进行计算。 After few iterations, I get the error 203 raised by GAE and task fails with code 500. 经过几次迭代，我得到了GAE引发的错误203，任务失败，代码为500。

I know that my code may not be best suited for GAE. 我知道我的代码可能不是最适合GAE。 But what is the best way to implement it on GAE ? 但是在GAE上实现它的最佳方法是什么？

Answer 1

There may be even more efficient ways to store your data and to iterate over it. 甚至可能存在更有效的方式来存储数据并对其进行迭代。

Questions: 问题：

What datatype are you storing, list of list ... of int ? 您要存储什么数据类型list of list ... of int ？
What range of the nested list does your innermost loop O(n^2) portion typically operate over? 您最里面的循环O（n ^ 2）部分通常在什么范围的嵌套列表上运行？
When you do the putitem, getitem how many values are you retrieving in a single put or get? 当执行putitem时，getitem在一个put或get中将检索多少个值？

Ideas: 思路：

You could try compressing your json (and base64 for cut and pasting). 您可以尝试压缩json（以及用于剪切和粘贴的base64）。 'myjson'.encode('zlib').encode('base64')
Using a divide and conquer (map reduce) as @Robert suggested. 按照@Robert的建议使用分而治之（map reduce）。 You may be able to use a dictionary with tuples for keys, this may be fewer lookups then A[j][d][d1][d2] in your inner loop. 您可能可以使用带有元组的字典作为键，这可能比内部循环中的A[j][d][d1][d2] 。 It would also allow you to sparsly populate your structure. 它还将允许您稀疏地填充您的结构。 You would need to track and know your bounds of what data you loaded in another way. 您将需要跟踪并知道您以其他方式加载了哪些数据的范围。 A[j][d][d1][d2] becomes D[(j,d,d1,d2)] or D[j,d,d1,d2]

Answer 2

You've omitted important details like the expected size of n from your question. 您已经从问题中省略了重要的细节，例如预期的n大小。 Also, does the " # block which runs in O(n^2) " need access to the entire matrix, or are you simply populating the matrix based on the index values? 另外，“ # block which runs in O(n^2) ”是否需要访问整个矩阵，还是只是根据索引值填充矩阵？

Here is a general answer: you need to find a way to break this up into smaller chunks. 这是一个一般性的答案：您需要找到一种将其分解为较小块的方法。 Maybe you can use some type of divide and conquer strategy and use tasks for parallelism. 也许您可以使用某种类型的分而治之策略，并使用任务进行并行处理。 How you store your matrix depends on how you split the problem up. 如何存储矩阵取决于您如何分解问题。 You might be able to store submatrices, or perhaps subvectors using the index values as key-names; 您也许可以使用索引值作为键名来存储子矩阵或子向量； again, this will depend on your problem and the strategy you use. 同样，这将取决于您的问题和您使用的策略。

An alternative, if for some reason you can not figure out how to parallelize your algorithm, is to use a continuation strategy of some type. 如果由于某种原因您无法弄清楚如何并行化算法，另一种方法是使用某种类型的延续策略。 In other works, figure out about how many iterations you can typically do within the time constraints (leaving a safety margin), then once you hit that limit save your data and insert a new task to continue the processing. 在其他工作中，找出在限制时间内通常可以进行多少次迭代（保留安全裕度），然后在达到该限制后保存数据并插入新任务以继续进行处理。 You'll just need to pass in the starting position, then resume running from there. 您只需要通过起始位置，然后从那里继续运行即可。 You may be able to do this easily by giving a starting parameter to the outermost range, but again it depends on the specifics of your problem. 您可以通过将起始参数设置为最外部的范围来轻松实现此目的，但这又取决于问题的具体情况。

Answer 3

Sam, just give you an idea and pointer on where to start. 山姆，只给您一个想法和指向从哪里开始。

If what you need is somewhere between storing the whole matrix and storing the numbers one-by-one, may be you will be interested to use pickle to serialize your list, and store them in datastore for later retrieval. 如果您需要的是介于存储整个矩阵和存储数字之间的某个位置，那么可能会感兴趣的是使用pickle序列化您的列表，并将其存储在数据存储中以供以后检索。 list is a python object, and you should be able to serialize it. list是一个python对象，您应该可以对其进行序列化。

http://appengine-cookbook.appspot.com/recipe/how-to-put-any-python-object-in-a-datastore http://appengine-cookbook.appspot.com/recipe/how-to-put-any-python-object-in-a-datastore

如何设计内存和计算密集型程序以在Google App Engine上运行

问题描述

3 个解决方案

解决方案1
1 已采纳 2010-12-28 16:01:29

解决方案2
0 2010-12-28 07:43:42

解决方案3
0 2011-02-22 10:29:56

如何设计内存和计算密集型程序以在Google App Engine上运行

问题描述

3 个解决方案

解决方案1 1 已采纳 2010-12-28 16:01:29

解决方案2 0 2010-12-28 07:43:42

解决方案3 0 2011-02-22 10:29:56

解决方案1
1 已采纳 2010-12-28 16:01:29

解决方案2
0 2010-12-28 07:43:42

解决方案3
0 2011-02-22 10:29:56