简体繁体 English

python使巨大的文件保留在内存中

[英]python make huge file persist in memory

原文 2015-04-27 23:31:47 1 3 python/ pandas/ pickle

I have a python script that needs to read a huge file into a var and then search into it and perform other stuff, the problem is the web server calls this script multiple times and every time i am having a latency of around 8 seconds while the file loads. 我有一个python脚本，需要将一个巨大的文件读入var，然后搜索它并执行其他操作，问题是Web服务器多次调用此脚本，并且每次我有大约8秒钟的延迟时，文件加载。 Is it possible to make the file persist in memory to have faster access to it atlater times ? 是否可以使文件保留在内存中，以便在以后更快地访问它？ I know i can make the script as a service using supervisor but i can't do that for this. 我知道我可以使用supervisor将脚本作为服务来supervisor但是我不能为此做。

Any other suggestions please. 还有其他建议。 PS I am already using var = pickle.load(open(file)) PS我已经在使用var = pickle.load(open(file))

3 个解决方案

You should take a look at http://docs.h5py.org/en/latest/ . 您应该看看http://docs.h5py.org/en/latest/ 。 It allows to perform various operations on huge files. 它允许对大型文件执行各种操作。 It's what the NASA uses. 这就是NASA的用途。

Not an easy problem. 这不是一个简单的问题。 I assume you can do nothing about the fact that your web server calls your application multiple times. 我假设您无法对Web服务器多次调用应用程序这一事实做任何事情。 In that case I see two solutions: 在这种情况下，我看到两个解决方案：

(1) Write TWO separate applications. （1）编写两个单独的应用程序。 The first application, A, loads the large file and then it just sits there, waiting for the other application to access the data. 第一个应用程序A加载大文件，然后将其放在那里，等待其他应用程序访问数据。 "A" provides access as required, so it's basically a sort of custom server. “ A”根据需要提供访问权限，因此它基本上是一种定制服务器。 The second application, B, is the one that gets called multiple times by the web server. 第二个应用程序B是Web服务器多次调用的应用程序。 On each call, it extracts the necessary data from A using some form of interprocess communication. 在每次调用时，它使用某种形式的进程间通信从A中提取必要的数据。 This ought to be relatively fast. 这应该相对较快。 The Python standard library offers some tools for interprocess communication (socket, http server) but they are rather low-level. Python标准库提供了一些用于进程间通信的工具（套接字，http服务器），但是它们是底层的。 Alternatives are almost certainly going to be operating-system dependent. 几乎可以肯定，替代方案将取决于操作系统。

(2) Perhaps you can pre-digest or pre-analyze the large file, writing out a more compact file that can be loaded quickly. （2）也许您可以预先消化或预先分析大文件，写出可以快速加载的更紧凑的文件。 A similar idea is suggested by tdelaney in his comment (some sort of database arrangement). tdelaney在他的评论（某种数据库安排）中提出了类似的想法。

You are talking about memory-caching a large array, essentially…? 您实际上是在谈论对大型数组进行内存缓存的方法……？

There are three fairly viable options for large arrays: 对于大型阵列，有三个相当可行的选择：

use memory-mapped arrays 使用内存映射数组
use h5py or pytables as a back-end 使用h5py或pytables作为后端
use an array caching-aware package like klepto or joblib . 使用数组缓存感知包，例如klepto或joblib 。

Memory-mapped arrays index the array in file, as if there were in memory. 内存映射的数组对文件中的数组进行索引，就像在内存中一样。 h5py or pytables give you fast access to arrays on disk, and also can avoid the load of the entire array into memory. h5py或pytables使您可以快速访问磁盘上的阵列，还可以避免将整个阵列加载到内存中。 klepto and joblib can store arrays as a collection of "database" entries (typically a directory tree of files on disk), so you can load portions of the array into memory easily. klepto和joblib可以将阵列存储为“数据库”条目的集合（通常是磁盘上文件的目录树），因此您可以轻松地将阵列的某些部分加载到内存中。 Each have a different use case, so the best choice for you depends on what you want to do. 每个案例都有不同的用例，因此最佳选择取决于您要执行的操作。 (I'm the klepto author, and it can use SQL database tables as a backend instead of files). （我是klepto作者，它可以将SQL数据库表用作后端而不是文件）。