简体   繁体   中英

Storing large python object in RAM for later use

Is it possible to store python (or C++) data in RAM for latter use and how can this be achieved?

Background: I have written a program that finds which lines in the input table match the given regular expression. I can find all the lines in roughly one second or less. However the problem is that i process the input table into a python object every time i start this program. This process takes about 30minutes.

This program will eventually run on a machine with over 128GB of RAM. The python object takes about 2GB of RAM. The input table changes rarely and therefore the python object (that i'm currently recalculating every time) actually changes rarely. Is there a way that i can create this python object once, store it in RAM 24/7 (recreate if input table changes or server restarts) and then use it every time when needed?

NOTE: The python object will not be modified after creation. However i need to be able to recreate this object if needed.

EDIT: Only solution i can think of is just to keep the program running 24/7 (as a daemon??) and then issuing commands to it as needed.

You could try pickling your object and saving it to a file, so that each time the program runs it just has to deserialise the object instead of recalculating it. Hopefully the server's disk cache will keep the file hot if necessary.

To store anything in RAM you need an running process. Therefore the easiest solution is to implement what you wrote in your edit. You could also create a new process that always runs and let the old process connect to the new one to get the data. How you connect is up to you. You could use shared memory or a TCP/IP socket. TCP/IP has the advantage of allowing the data to be network accessible, but please make it secure.

--edit--

Most operating systems also allow you to mount a pace of RAM as a drive. A RAM drive . You could write (like Neil suggested) the objects to that.

We regularly load and store much larger chunks of memory than 2 Gb in no time (seconds). We can get 350 Mb/s from our 3 year old SAN.

The bottlenecks /overheads seem to involve mainly python object management. I find that using marshal is much faster than cPickle. Allied with the use of data structures which involve minimal python object handles, this is more than fast enough.

For data structures, you can either use array.array or numpy . array.array is slightly more portable (no extra libraries involved) but numpy is much more convenient in many ways. For example, instead of having 10 million integer (python objects), you would create a single array.array('i') with 10 million elements.

The best part to using marshal is that it is a very simple format you can write to and read from easily using c/c++ code.

Your problem description is kind of vague and can be read in several different ways.

One way in which I read this is that you have some kind of ASCII representation of a data structure on disk. You read this representation into memory, and then grep through it one or more times looking for things that match a given regular expression.

Speeding this up depends a LOT on the data structure in question.

If you are simply doing line splitting, then maybe you should just read the whole thing into a byte array using a single read instruction. Then you can alter how you grep to use a byte-array grep that doesn't span multiple lines. If you fiddle the expression to always match a whole line by putting ^.*? at the beginning and .*?$ at the end (the ? forces a minimal instead of maximal munch) then you can check the size of the matched expression to find out how many bytes forward to go.

Alternately, you could try using the mmap module to achieve something similar without having to read anything and incur the copy overhead.

If there is a lot of processing going on to create your data structure and you can't think of a way to use the data in the file in a very raw way as a simple byte array, then you're left with various other solutions depending, though of these it sounds like creating a daemon is the best option.

Since your basic operation seems to be 'tell me which tables entries match a regexp', you could use the xmlrpc.server and xmlrpc.client libraries to simply wrap up a call that takes the regular expression as a string and returns the result in whatever form is natural. The library will take care of all the work of wrapping up things that look like function calls into messages over a socket or whatever.

Now, your idea of actually keeping it in memory is a bit of a red-herring. I don't think it takes 30 minutes to read 2G of information from disk these days. It likely takes at most 5, and likely less than 1. So you might want to look at how you're building the data structure to see if you could optimize that instead.

What pickle and/or marshal will buy you is highly optimized code for building the data structure out of a serialized form. This will cause the data structure creation to possibly be constrained by disk read speeds instead. That means the real problem you're addressing is not reading it off disk each time, but building the data structure in your own address space.

And holding it in memory and using a daemon isn't a guarantee that it will stay in memory. It just guarantees that it stays built up as the data structure you want within the address space of a Python process. The os may decide to swap that memory to disk at any time.

Again, this means that focusing on the time to read it from disk is likely not the right focus. Instead, focus on how to efficiently re-create (or preserve) the data structure in the address space of a Python process.

Anyway, that's my long-winded ramble on the topic. Given the vagueness of your question, there is no definite answer, so I just gave a smorgasbord of possible techniques and some guiding ideas.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM