简体繁体 English

创建类似文件的对象作为数据缓冲区

[英]Create file-like object for data buffer

原文 2018-08-22 17:01:01 8 2 python/ python-3.x/ buffer

Contextualisation 语境化

I am writing a program that is able to read data from a sensor and then do something with it. 我正在编写一个程序，该程序能够从传感器读取数据，然后对其进行处理。 Currently I want it to be sent to a server. 目前，我希望将其发送到服务器。 I have two processes that communicate through sockets, one that reads the data and stores it to a temporary file, and other that reads the temporary file, sends the data to the server. 我有两个通过套接字进行通信的进程，一个进程读取数据并将其存储到一个临时文件，另一个进程读取该临时文件，然后将数据发送到服务器。

Problem 问题

The problem has actually never presented itself in the testing, however I have realised that it is highly possible that if the sampling frequency is high both processes coincide in trying to read/write to the file at the same time (not that they request it exactly at the same time, but that one tries to open it before the other has closed it). 该问题实际上从未在测试中出现过，但是我已经意识到，很有可能如果采样频率很高，则两个进程在尝试同时读取/写入文件时会同时发生 （不是他们要求完全相同）同时，但一个尝试在另一个关闭之前将其打开）。

Even if this does not raise an error (for what I read online, some OS do not put locks into the file) it may cause huge version incompatibility errors, leading to lost pieces of data. 即使这不会引发错误（对于我在网上阅读的内容，某些操作系统也不会在文件中添加锁），它也可能导致巨大的版本不兼容错误，从而导致数据丢失。 For this reason, this way of handling the data does not look very appropriate. 因此，这种处理数据的方式看起来不太合适。

My own idea/approach 我自己的想法/方法

I thought to use a file-like object in memory (data buffer). 我想在内存（数据缓冲区）中使用类似文件的对象。 I have no experience with this concept in Python, so I have researched a bit and I understand that [a buffer] is like a file that is kept in memory while the program is executing and that has very similar properties to that of a standard system file. 我没有在Python中使用此概念的经验，因此我进行了一些研究，并且我了解[缓冲区]就像一个文件，该文件在程序执行时保留在内存中，并且具有与标准系统非常相似的属性文件。 I thought it might be a good idea to use it, however I could not find a solution to some of this inconveniences: 我认为使用它可能是一个好主意，但是我无法找到解决这些不便的方法：

Since it's still like a file (file-like object), could it not be the case that if the two processes coincide in their opeartions on the object, version incompatibility errors/bugs could raise? 由于它仍然像文件（类似文件的对象）一样，如果两个进程在对象上的操作重合，会不会引发版本不兼容错误/错误呢？ I only need to append data with one process (at the end) and remove data from the beginning with the other (as some sort of a queue). 我只需要向一个进程附加数据（在末尾），并从另一个进程的开头移除数据（作为某种队列）。 Does this Python functionality permit this, and if so, which methods may I exactly look into in the docs? 这种Python功能是否允许这样做？如果可以，我可以在文档中确切地查找哪些方法？
For the explanation above, I thought about literally using queues; 对于上面的解释，我考虑过使用字面上的队列。 however this might be unefficient execution time-wise (appending to a list is ratherfast, but appending to a pandas object is around 1000 times slower according to a test I did in my own machine to see which object type would fit best). 但是，这可能在时间上执行效率不高（根据我在自己的机器上进行的测试（最适合哪种对象类型），追加到列表的速度相当快，但是追加到熊猫对象的速度要慢1000倍左右）。 Is there an object, if not a file-like one, that lets me do this and is efficient? 是否有一个对象（如果不是类似于文件的对象）可以让我做到这一点并且效率很高？ I know efficiency is subjective, so let's say 100 appends per second with no noticeable lag (timestamps are important in this case). 我知道效率是主观的，因此可以说每秒100次附加，没有明显的延迟（在这种情况下，时间戳很重要）。
Since I am using two different processes and these do not share memory in Python, is it still possible to point to the same memory address while operating on the file-like object? 由于我使用的是两个不同的进程，并且它们在Python中不共享内存，因此在处理类似文件的对象时是否仍然可以指向相同的内存地址？ I communicate them with sockets as I said, but that method is afaik call-by-value, not reference; 如我所说，我通过套接字与它们通信，但是该方法是afaik按值调用，而不是引用。 so this looks like a serious problem to me (maybe it is necessary to merge them into two threads instead of different python processes?) 所以这对我来说似乎是一个严重的问题（也许有必要将它们合并为两个线程，而不是不同的python进程吗？）

May you comment asking for any other detail if needed, I will be very happy to answer. 如果您需要其他任何细节，请发表评论，我将很乐意回答。

Edits: questions asked in comments: 编辑：在评论中提出的问题：

How are you creating these processes? 您如何创建这些流程？ Through a Python module like multiprocessing or subprocess , or some other way? 通过像multiprocessing或subprocess这样的Python模块，还是其他方式？

I am running them as two completely separate programs. 我将它们作为两个完全独立的程序运行。 Each has a different main python file that is called by a shell script; 每个都有一个不同的主Python文件，该文件由Shell脚本调用； however, I am flexible to changing this behaviour if needed. 但是，如果需要，我可以灵活地更改此行为。

On the other hand, the process that reads the data from the sensors has two threads: one that literally reads the data, and other that listens to sockets requests. 另一方面，从传感器读取数据的过程有两个线程：一个从字面上读取数据，另一个则侦听套接字请求。

what type of data are you getting from the sensor and sending to the server? 您从传感器获取并发送到服务器的数据类型是什么？

I am sending tables that contain floats, generally, however sensors may also produce video stream or other sort of data structures. 通常，我要发送包含浮点数的表，但是传感器也可能会产生视频流或其他类型的数据结构。

Misconception of Queue | 对队列的误解 pandas 大熊猫

I know a queue has nothing to do with a dataframe; 我知道队列与数据帧无关。 I am just saying I tried to use a dataframe and it didn't perform well because it's thought to pre-allocate the memory space it needs (if I'm right). 我只是说我试图使用一个数据框，但它表现不佳，因为它被认为是预先分配了所需的内存空间（如果我是对的）。 I am just expressing my concerns in the performance of the solution. 我只是对解决方案的性能表示关注。

2 个解决方案

First, you really are looking at building exactly what io.BytesIO already does. 首先，您实际上正在考虑完全构建io.BytesIO已经完成的工作。 It's a file-like object that's stored entirely in memory. 这是一个类似文件的对象，完全存储在内存中。 Each process's objects are completely independent of every other process's objects. 每个进程的对象都完全独立于其他每个进程的对象。 It just is everything you want. 这就是您想要的一切。 But it isn't going to do you any good here. 但这对您没有好处。 The fact that it's a file-like object doesn't mean that it's accessible from other processes. 它是一个类似文件的对象，这并不意味着可以从其他进程访问它。 In fact, that's the whole point of file-like objects that aren't files: they aren't files. 事实上，这不属于文件类文件对象的整点：他们不是文件。

But you could just explicitly lock your files. 但是您可以明确地锁定文件。

It's true that, other than Windows, most operating systems don't automatically lock files, and some don't even have “mandatory” locks, only cooperative locks that don't actually protect files unless all of the programs are written to use the locks. 的确，除了Windows之外，大多数操作系统不会自动锁定文件，有些甚至没有“强制性”锁，只有协作锁才真正保护文件，除非所有程序都编写为使用锁。 But that's not a problem. 但这不是问题。

One option is to write separate code for Windows and Unix: On Windows, rely on opening the files in exclusive mode; 一种选择是为Windows和Unix编写单独的代码：在Windows上，依靠以独占模式打开文件；在Windows上，以独占模式打开文件。 on Unix, use flock . 在Unix上，使用flock 。

The other option is to create manual lockfiles. 另一个选项是创建手动锁定文件。 You can atomically try to create a file and fail if someone else did it first on every platform by just using os.open with the O_CREAT|O_EXCL flags, and you can build everything else you need on top of that. 您可以通过仅使用带有O_CREAT|O_EXCL标志的os.open原子地尝试创建文件，如果有人首先在每个平台上首先使该文件失败，则可以在此基础上构建其他所有所需的文件。

If you're thinking about using shared memory, unless you're using multiprocessing , it's pretty painful to do that in a cross-platform way. 如果您正在考虑使用共享内存，除非您正在使用multiprocessing ，否则以跨平台的方式进行操作会很痛苦。

But you can get the same effect by using a regular file and using mmap in each process to access the file as if it were normal memory. 但是，通过使用常规文件并在每个进程中使用mmap来访问文件，就好像文件是普通内存一样，您可以获得相同的效果。 Just make sure to only use the cross-platform values for length and access (and not to use platform-specific parameters like prot or flags ) and it works the same way everywhere. 只要确保仅将跨平台值用于length和access （不要使用prot或flags等平台特定的参数），并且它在任何地方都以相同的方式工作。

Of course you can't put Python objects into either shared memory or an mmap, but you can put raw bytes, or “native values”, or arrays of them, or ctypes Structures of them, or, best of all, multidimensional numpy arrays of them. 当然，您不能将Python对象放入共享内存或mmap中，但可以将原始字节或“本机值”或它们的数组或ctypes它们的Structures ，或者最好是多维numpy数组放入其中。 For all but the last one, you can use the appropriate wrapper objects out of multiprocessing even if you aren't otherwise using the module. 对于除最后一个以外的所有对象，即使不使用模块，也可以使用适当的包装对象进行multiprocessing 。 For the last one, just use np.memmap instead of using mmap directly, and it takes care of everything. 对于最后一个，只需使用np.memmap而不是直接使用mmap ，它可以处理所有事情。

However, you may be right about a queue being faster. 但是，您可能认为队列速度更快是正确的。 If that really is a concern (although I'd actually build and test it to see whether that's a problem before solving it…), then go for it. 如果这确实是一个问题（尽管我实际上已经构建并测试了它，然后在解决之前查看它是否是问题...），则继续进行。 But you seem to have some misconceptions there. 但是您似乎在那里有些误解。

First, I don't know why you think a queue has anything to do with appending to pandas DataFrames. 首先，我不知道您为什么认为队列与附加到pandas DataFrames有关。 I suppose you could use a df as a queue, but there's no intrinsic connection between the two. 我想您可以将df用作队列，但是两者之间没有内在联系。

Meanwhile, a list is fine for a small queue, but for a very large one, it's not. 同时，列表适合较小的队列，但对于较大的队列则不是。 Either you append to the right and pop from the left, or you append to the left and pop from the right. 您可以追加到右侧并从左侧弹出，也可以追加到左侧并从右侧弹出。 Either way, that operation on the left takes time linear in the size of the queue, because you have to shift the whole rest of be list left or right by one slot. 无论哪种方式，左边的操作都花费时间与队列大小成线性关系，因为您必须将整个列表be左右移动一个插槽。 The solution to this is collections.deque , an object that's almost the same as a list, except that it can insert or delete in constant time on both sides, instead of just the right side. 解决方案是collections.deque ，它是一个与列表几乎相同的对象，不同之处在于它可以在两侧（而不是仅在右侧）在恒定时间内插入或删除。

But again, that doesn't solve anything because it's not actually shared in any way. 但是同样，这并不能解决任何问题，因为它实际上并未以任何方式共享。 You need some kind of interprocess queue, and neither a DataFrame nor a list (nor a deque) helps there. 您需要某种类型的进程间队列，而DataFrame或列表（也不是双端队列）都无济于事。

You can build an interprocess queue on top of a pipe. 您可以在管道顶部构建进程间队列。 Depending on how your processes are run, this could be an anonymous pipe, where the launcher program hands an end of the pipe to the child program, or this could be a named pipe, which is slightly different on Windows vs. Unix, but in both cases it works by both programs having some globally-known name (like a filesystem path) to use to open the same pipe. 根据进程的运行方式，这可能是匿名管道，启动程序将管道的末端移交给子程序，也可能是命名管道，这在Windows与Unix上略有不同，但是在在这两种情况下，这两个程序都可以使用具有某些全局已知名称（例如文件系统路径）的文件来打开同一管道。

You can also build an interprocess queue on top of a TCP socket. 您还可以在TCP套接字的顶部构建进程间队列。 If you bind to localhost and connect to localhost, this is almost as efficient as a pipe, but it's simpler to write cross-platform. 如果绑定到localhost并连接到localhost，这几乎与管道一样高效，但是编写跨平台更简单。

So, how do you build a queue on top of a pipe or socket? 那么，如何在管道或套接字的顶部建立队列？ The only problem is that you have just a stream of bytes instead of a stream of messages. 唯一的问题是您只有字节流而不是消息流。

If your messages are all the same size, you just sendall on one side, and recv in a loop until you have MESSAGESIZE bytes. 如果消息的大小都相同，则只需在一侧sendall ，然后循环recv ，直到sendall MESSAGESIZE个字节。
If they're in some self-delimiting format like pickle , there's no problem; 如果它们采用诸如pickle之类的自定界格式，那就没有问题； just sendall on one side, and recv until you have a complete pickle on the other side. 只是sendall在一边，和recv ，直到你对对方一个完整的咸菜。 You can even use socket.makefile (only for sockets, not pipes, of course) to get a file-like object you can pass straight to pickle.dump' and pickle.load`. 您甚至可以使用socket.makefile （当然，仅用于套接字，而不是管道）来获取类似文件的对象，您可以直接将pickle.dump' and传递给pickle.dump' and pickle.load`。
You can use some kind of delimiter (eg, if your messages are text that can never include a newline, or can never include a NUL byte, you can just use newline or 0 as a delimiter—and if you use newline, makefile takes care of this for you again). 您可以使用某种定界符（例如，如果您的消息是永远不包含换行符或永远不包含NUL字节的文本，则可以仅使用换行符或0作为定界符-如果您使用换行符，则makefile会注意再次给您）。
Or you can send the size of each message before the message itself (eg, using a trivial protocol like netstring). 或者，您可以在消息本身之前发送每个消息的大小（例如，使用诸如netstring这样的简单协议）。

If you are (or could be) using the multiprocessing library to control all of your separate processes, it comes with a Queue class built in, which builds an IPC queue on top of sending pickles over a pipe in an efficient way for every major platform, but you don't have to worry about how it works; 如果您正在（或可能）使用multiprocessing库来控制所有单独的进程，则它带有内置的Queue类，该类在对每个主要平台有效地通过管道发送泡菜的基础上构建IPC队列。，但您不必担心它的工作方式； you just hand the queue off to your child processes, and you can put on one end and get on the other and it just works. 您只需将队列交给您的子进程，就可以put另一端， get继续工作就可以了。

You've misunderstood what a file-like object is. 您误解了什么是文件状对象。 "File-like object" describes the interface an object presents - methods like read or write , and line-by-line iteration. “文件状对象”描述了对象所呈现的接口-诸如read或write类的方法以及逐行迭代。 It doesn't say anything about whether it stores data in memory. 它没有说明是否将数据存储在内存中。 Regular file objects are file-like objects. 常规文件对象是类似文件的对象。 File objects for OS-level pipes are file-like objects. OS级管道的文件对象是类似文件的对象。 io.StringIO and io.BytesIO objects are file-like objects, and those actually do work like you seem to have been thinking. io.StringIO和io.BytesIO对象是类似于文件的对象，它们实际上确实像您在想的那样工作。

Rather than thinking in terms of file-like objects, you should probably think about what OS-level mechanism you want to use to communicate between your processes. 与其考虑类文件对象，不如考虑使用哪种OS级机制在进程之间进行通信。 You've already got sockets; 您已经有了套接字； why not send data between your processes with a socket? 为什么不使用套接字在进程之间发送数据？ A pipe would be another option. 管道将是另一种选择。 Shared memory is possible but platform-dependent and tricky; 共享内存是可能的，但是依赖于平台且棘手。 it's probably not the best option. 这可能不是最佳选择。