简体   繁体   中英

Create file-like object for data buffer

Contextualisation

I am writing a program that is able to read data from a sensor and then do something with it. Currently I want it to be sent to a server. I have two processes that communicate through sockets, one that reads the data and stores it to a temporary file, and other that reads the temporary file, sends the data to the server.

Problem

The problem has actually never presented itself in the testing, however I have realised that it is highly possible that if the sampling frequency is high both processes coincide in trying to read/write to the file at the same time (not that they request it exactly at the same time, but that one tries to open it before the other has closed it).

Even if this does not raise an error (for what I read online, some OS do not put locks into the file) it may cause huge version incompatibility errors, leading to lost pieces of data. For this reason, this way of handling the data does not look very appropriate.

My own idea/approach

I thought to use a file-like object in memory (data buffer). I have no experience with this concept in Python, so I have researched a bit and I understand that [a buffer] is like a file that is kept in memory while the program is executing and that has very similar properties to that of a standard system file. I thought it might be a good idea to use it, however I could not find a solution to some of this inconveniences:

  1. Since it's still like a file (file-like object), could it not be the case that if the two processes coincide in their opeartions on the object, version incompatibility errors/bugs could raise? I only need to append data with one process (at the end) and remove data from the beginning with the other (as some sort of a queue). Does this Python functionality permit this, and if so, which methods may I exactly look into in the docs?

  2. For the explanation above, I thought about literally using queues; however this might be unefficient execution time-wise (appending to a list is ratherfast, but appending to a pandas object is around 1000 times slower according to a test I did in my own machine to see which object type would fit best). Is there an object, if not a file-like one, that lets me do this and is efficient? I know efficiency is subjective, so let's say 100 appends per second with no noticeable lag (timestamps are important in this case).

  3. Since I am using two different processes and these do not share memory in Python, is it still possible to point to the same memory address while operating on the file-like object? I communicate them with sockets as I said, but that method is afaik call-by-value, not reference; so this looks like a serious problem to me (maybe it is necessary to merge them into two threads instead of different python processes?)

May you comment asking for any other detail if needed, I will be very happy to answer.

Edits: questions asked in comments:

How are you creating these processes? Through a Python module like multiprocessing or subprocess , or some other way?

I am running them as two completely separate programs. Each has a different main python file that is called by a shell script; however, I am flexible to changing this behaviour if needed.

On the other hand, the process that reads the data from the sensors has two threads: one that literally reads the data, and other that listens to sockets requests.

what type of data are you getting from the sensor and sending to the server?

I am sending tables that contain floats, generally, however sensors may also produce video stream or other sort of data structures.

Misconception of Queue | pandas

I know a queue has nothing to do with a dataframe; I am just saying I tried to use a dataframe and it didn't perform well because it's thought to pre-allocate the memory space it needs (if I'm right). I am just expressing my concerns in the performance of the solution.

First, you really are looking at building exactly what io.BytesIO already does. It's a file-like object that's stored entirely in memory. Each process's objects are completely independent of every other process's objects. It just is everything you want. But it isn't going to do you any good here. The fact that it's a file-like object doesn't mean that it's accessible from other processes. In fact, that's the whole point of file-like objects that aren't files: they aren't files.

But you could just explicitly lock your files.

It's true that, other than Windows, most operating systems don't automatically lock files, and some don't even have “mandatory” locks, only cooperative locks that don't actually protect files unless all of the programs are written to use the locks. But that's not a problem.

One option is to write separate code for Windows and Unix: On Windows, rely on opening the files in exclusive mode; on Unix, use flock .

The other option is to create manual lockfiles. You can atomically try to create a file and fail if someone else did it first on every platform by just using os.open with the O_CREAT|O_EXCL flags, and you can build everything else you need on top of that.


If you're thinking about using shared memory, unless you're using multiprocessing , it's pretty painful to do that in a cross-platform way.

But you can get the same effect by using a regular file and using mmap in each process to access the file as if it were normal memory. Just make sure to only use the cross-platform values for length and access (and not to use platform-specific parameters like prot or flags ) and it works the same way everywhere.

Of course you can't put Python objects into either shared memory or an mmap, but you can put raw bytes, or “native values”, or arrays of them, or ctypes Structures of them, or, best of all, multidimensional numpy arrays of them. For all but the last one, you can use the appropriate wrapper objects out of multiprocessing even if you aren't otherwise using the module. For the last one, just use np.memmap instead of using mmap directly, and it takes care of everything.


However, you may be right about a queue being faster. If that really is a concern (although I'd actually build and test it to see whether that's a problem before solving it…), then go for it. But you seem to have some misconceptions there.

First, I don't know why you think a queue has anything to do with appending to pandas DataFrames. I suppose you could use a df as a queue, but there's no intrinsic connection between the two.

Meanwhile, a list is fine for a small queue, but for a very large one, it's not. Either you append to the right and pop from the left, or you append to the left and pop from the right. Either way, that operation on the left takes time linear in the size of the queue, because you have to shift the whole rest of be list left or right by one slot. The solution to this is collections.deque , an object that's almost the same as a list, except that it can insert or delete in constant time on both sides, instead of just the right side.

But again, that doesn't solve anything because it's not actually shared in any way. You need some kind of interprocess queue, and neither a DataFrame nor a list (nor a deque) helps there.

You can build an interprocess queue on top of a pipe. Depending on how your processes are run, this could be an anonymous pipe, where the launcher program hands an end of the pipe to the child program, or this could be a named pipe, which is slightly different on Windows vs. Unix, but in both cases it works by both programs having some globally-known name (like a filesystem path) to use to open the same pipe.

You can also build an interprocess queue on top of a TCP socket. If you bind to localhost and connect to localhost, this is almost as efficient as a pipe, but it's simpler to write cross-platform.

So, how do you build a queue on top of a pipe or socket? The only problem is that you have just a stream of bytes instead of a stream of messages.

  • If your messages are all the same size, you just sendall on one side, and recv in a loop until you have MESSAGESIZE bytes.
  • If they're in some self-delimiting format like pickle , there's no problem; just sendall on one side, and recv until you have a complete pickle on the other side. You can even use socket.makefile (only for sockets, not pipes, of course) to get a file-like object you can pass straight to pickle.dump' and pickle.load`.
  • You can use some kind of delimiter (eg, if your messages are text that can never include a newline, or can never include a NUL byte, you can just use newline or 0 as a delimiter—and if you use newline, makefile takes care of this for you again).
  • Or you can send the size of each message before the message itself (eg, using a trivial protocol like netstring).

If you are (or could be) using the multiprocessing library to control all of your separate processes, it comes with a Queue class built in, which builds an IPC queue on top of sending pickles over a pipe in an efficient way for every major platform, but you don't have to worry about how it works; you just hand the queue off to your child processes, and you can put on one end and get on the other and it just works.

You've misunderstood what a file-like object is. "File-like object" describes the interface an object presents - methods like read or write , and line-by-line iteration. It doesn't say anything about whether it stores data in memory. Regular file objects are file-like objects. File objects for OS-level pipes are file-like objects. io.StringIO and io.BytesIO objects are file-like objects, and those actually do work like you seem to have been thinking.

Rather than thinking in terms of file-like objects, you should probably think about what OS-level mechanism you want to use to communicate between your processes. You've already got sockets; why not send data between your processes with a socket? A pipe would be another option. Shared memory is possible but platform-dependent and tricky; it's probably not the best option.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM