简体   繁体   中英

python make huge file persist in memory

I have a python script that needs to read a huge file into a var and then search into it and perform other stuff, the problem is the web server calls this script multiple times and every time i am having a latency of around 8 seconds while the file loads. Is it possible to make the file persist in memory to have faster access to it atlater times ? I know i can make the script as a service using supervisor but i can't do that for this.

Any other suggestions please. PS I am already using var = pickle.load(open(file))

You should take a look at http://docs.h5py.org/en/latest/ . It allows to perform various operations on huge files. It's what the NASA uses.

Not an easy problem. I assume you can do nothing about the fact that your web server calls your application multiple times. In that case I see two solutions:

(1) Write TWO separate applications. The first application, A, loads the large file and then it just sits there, waiting for the other application to access the data. "A" provides access as required, so it's basically a sort of custom server. The second application, B, is the one that gets called multiple times by the web server. On each call, it extracts the necessary data from A using some form of interprocess communication. This ought to be relatively fast. The Python standard library offers some tools for interprocess communication (socket, http server) but they are rather low-level. Alternatives are almost certainly going to be operating-system dependent.

(2) Perhaps you can pre-digest or pre-analyze the large file, writing out a more compact file that can be loaded quickly. A similar idea is suggested by tdelaney in his comment (some sort of database arrangement).

You are talking about memory-caching a large array, essentially…?

There are three fairly viable options for large arrays:

  1. use memory-mapped arrays
  2. use h5py or pytables as a back-end
  3. use an array caching-aware package like klepto or joblib .

Memory-mapped arrays index the array in file, as if there were in memory. h5py or pytables give you fast access to arrays on disk, and also can avoid the load of the entire array into memory. klepto and joblib can store arrays as a collection of "database" entries (typically a directory tree of files on disk), so you can load portions of the array into memory easily. Each have a different use case, so the best choice for you depends on what you want to do. (I'm the klepto author, and it can use SQL database tables as a backend instead of files).

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM