简体繁体 English

如何在Python中处理大量内存？

[英]How to handle large memory footprint in Python?

原文 2012-01-30 21:29:24 7 2 python

I have a scientific application that reads a potentially huge data file from disk and transforms it into various Python data structures such as a map of maps, list of lists etc. NumPy is called in for numerical analysis. 我有一个科学的应用程序，从磁盘读取一个潜在的巨大数据文件，并将其转换为各种Python数据结构，如地图，列表列表等NumPy被调用进行数值分析。 The problem is, the memory usage can grow rapidly. 问题是，内存使用量可以快速增长。 As swap space is called in, the system slows down significantly. 当调用交换空间时，系统会显着减慢。 The general strategy I have seen: 我看到的一般策略：

lazy initialization: this doesn't seem to help in the sense that many operations require in memory data anyway. 延迟初始化：这似乎没有帮助，因为无论如何许多操作都需要在内存数据中。
shelving: this Python standard library seems support writing data object into a datafile (backed by some db) . 搁置：这个Python标准库似乎支持将数据对象写入数据文件（由某个db支持）。 My understanding is that it dumps data to a file, but if you need it, you still have to load all of them into memory, so it doesn't exactly help. 我的理解是它将数据转储到文件中，但是如果你需要它，你仍然必须将它们全部加载到内存中，所以它并没有完全帮助。 Please correct me if this is a misunderstanding. 如果这是一个误解，请纠正我。
The third option is to leverage a database, and offload as much data processing to it 第三种选择是利用数据库，并卸载尽可能多的数据处理

As an example: a scientific experiment runs several days and have generated a huge (tera bytes of data) sequence of: 举个例子：一个科学实验运行了几天，产生了一个巨大的（tera字节数据）序列：

co-ordinate(x,y) observed event E at time t. 坐标（x，y）在时间t观察到事件E.

And we need to compute a histogram over t for each (x,y) and output a 3-dimensional array. 我们需要为每个（x，y）计算t上的直方图并输出一个三维数组。

Any other suggestions? 还有其他建议吗？ I guess my ideal case would be the in-memory data structure can be phased to disk based on a soft memory limit and this process should be as transparent as possible. 我想我的理想情况是内存数据结构可以基于软内存限制分阶段到磁盘，这个过程应该尽可能透明。 Can any of these caching frameworks help? 任何这些缓存框架都可以提供帮助吗？

Edit: 编辑：

I appreciate all the suggested points and directions. 我很欣赏所有建议的观点和方向。 Among those, I found user488551's comments to be most relevant. 其中，我发现user488551的评论最相关。 As much as I like Map/Reduce, to many scientific apps, the setup and effort for parallelization of code is even a bigger problem to tackle than my original question, IMHO. 就像我喜欢Map / Reduce，对于许多科学应用程序而言，代码并行化的设置和工作甚至比我原来的问题，恕我直言更难解决。 It is difficult to pick an answer as my question itself is so open ... but Bill's answer is more close to what we can do in real world, hence the choice. 由于我的问题本身是如此公开，所以很难找到答案......但比尔的回答更接近于我们在现实世界中可以做的事情，因此选择。 Thank you all. 谢谢你们。

2 个解决方案

Have you considered divide and conquer? 你考虑过分而治之吗？ Maybe your problem lends itself to that. 也许你的问题适合于此。 One framework you could use for that is Map/Reduce. 您可以使用的一个框架是Map / Reduce。

Does your problem have multiple phases such that Phase I requires some data as input and generates an output which can be fed to phase II? 您的问题是否有多个阶段，以便第一阶段需要一些数据作为输入并生成一个可以输入到阶段II的输出？ In that case you can have 1 process do phase I and generate data for phase II. 在这种情况下，您可以让1个流程执行第I阶段，并为第II阶段生成数据。 Maybe this will reduce the amount of data you simultaneously need in memory? 也许这会减少你在内存中同时需要的数据量？

Can you divide your problem into many small problems and recombine the solutions? 您能否将问题分成许多小问题并重新组合解决方案？ In this case you can spawn multiple processes that each handle a small sub-problem and have one or more processes to combine these results in the end? 在这种情况下，您可以生成多个进程，每个进程处理一个小的子问题，并有一个或多个进程最终组合这些结果？

If Map-Reduce works for you look at the Hadoop framework. 如果Map-Reduce适用于您，请查看Hadoop框架。

Well, if you need the whole dataset in RAM, there's not much to do but get more RAM. 好吧，如果你需要在RAM中使用整个数据集，那么除了获得更多内存之外没什么可做的。 Sounds like you aren't sure if you really need to, but keeping all the data resident requires the smallest amount of thinking :) 听起来你不确定你是否真的需要，但保持所有数据驻留需要最少的思考:)

If your data comes in a stream over a long period of time, and all you are doing is creating a histogram, you don't need to keep it all resident. 如果您的数据在很长一段时间内都是一个流，而您所做的只是创建一个直方图，您不需要将它全部保留在那里。 Just create your histogram as you go along, write the raw data out to a file if you want to have it available later, and let Python garbage collect the data as soon as you have bumped your histogram counters. 只需创建你的直方图，如果你想让它在以后可用，就把原始数据写到文件中，让Python在你的直方图计数器碰撞后立即收集数据。 All you have to keep resident is the histogram itself, which should be relatively small. 所有你必须保持居住的是直方图本身，它应该相对较小。