如何在不阻塞流的情况下保存数据流中的数据？（PyQt5信号emit（）性能）

Question

I'm developing a PyQt5 application. 我正在开发一个PyQt5应用程序。 In my application, it has a data stream, and its speed is about 5~20 data/sec. 在我的应用程序中，它有一个数据流，其速度约为5~20个数据/秒。

Every time data arrives, the following onData() method of class Analyzer is called. 每次数据到达时，都会调用类Analyzer的以下onData()方法。 (Following code is simplified code of my app) （以下代码是我的应用程序的简化代码）

class Analyzer():
    def __init__(self):
        self.cnt = 0
        self.dataDeque = deque(MAXLENGTH=10000)
    def onData(self, data):
        self.dataDeque.append({
            "data": data, 
            "createdTime": time.time()
        })
        self.cnt += 1
        if self.cnt % 10000 == 0:
            pickle.dump(dataDeque, open(file, 'wb'))

But the problem is, this dataDeque object is so large(50~150MB) so that dumping the pickle takes about 1~2 seconds. 但问题是，这个dataDeque对象是如此之大（50~150MB），因此倾倒pickle大约需要1~2秒。

During that moment(1~2 seconds), requests for calling onData() method got queued, and after 1~2 seconds, the queued requests call lots of onData() method at simultaneously, eventually distorts the createdTime of data. 在那一刻（1~2秒），调用onData()方法的请求排队，并且在1~2秒后，排队的请求同时调用大量的onData()方法，最终扭曲createdTime的数据时间。

To solve this problem, I edited my code to use Thread (QThread) to save the pickle. 为了解决这个问题，我编辑了我的代码以使用Thread（QThread）来保存pickle。

The following code is the edited code. 以下代码是已编辑的代码。

from PickleDumpingThread import PickleDumpingThread
pickleDumpingThread = PickleDumpingThread()
pickleDumpingThread.start()

class Analyzer():
    def __init__(self):
        self.cnt = 0
        self.dataDeque = deque(MAXLENGTH=10000)
    def onData(self, data):
        self.dataDeque.append({
            "data": data, 
            "createdTime": time.time()
        })
        self.cnt += 1
        if self.cnt % 10000 == 0:
            pickleDumpingThread.pickleDumpingSignal.emit({
                "action": savePickle,
                "deque": self.dataDeque
            })
            # pickle.dump(dataDeque, open(file, 'wb'))

The following code is PickleDumpingThread class. 以下代码是PickleDumpingThread类。

class PickleDumpingThread(QThread):
   def __init__(self):
       super().__init__()
       self.daemon = True
       self.pickleDumpingSignal[dict].connect(self.savePickle)

   def savePickle(self, signal_dict):
       pickle.dump(signal_dict["deque"], open(file, 'wb'))

I expected this newly edited code will dramatically decrease the stream blocking time(1~2 seconds), but this code still blocks the stream about 0.5~2 seconds. 我预计这个新编辑的代码将大大减少流阻塞时间（1~2秒），但是这段代码仍会阻塞流约0.5~2秒。

It seems like pickleDumpingThread.pickleDumpingSignal.emit(somedict) takes 0.5~2 seconds. 似乎pickleDumpingThread.pickleDumpingSignal.emit(somedict)需要0.5~2秒。

My question is 3 things. 我的问题是3件事。

Is signal emit() function's performance is not good like this? 信号emit（）函数的性能是不是很好？
Is there any possible alternatives of emit() function in my case? 在我的情况下是否有任何替代的emit（）函数？
Or is there any way to save pickle while not blocking the data stream? 或者有没有办法在不阻塞数据流的同时保存pickle？ (any suggestion of modifying my code is highly appreciated) （任何修改我的代码的建议都非常感谢）

Thank you for reading this long question! 感谢您阅读这个长期的问题！

Answer 1

something like this might work 这样的事可能有用

class PickleDumpingThread(QThread):
   def __init__(self, data):
       super().__init__()
       self.data = data

   def run(self):
       pickle.dump(self.data["deque"], open(file, 'wb'))
       self.emit(QtCore.SIGNAL('threadFinished(int)'), self.currentThreadId())

class Analyzer():
    def __init__(self):
        self.cnt = 0
        self.dataDeque = deque(MAXLENGTH=10000)
        self.threadHandler = {}

    def onData(self, data):
        self.dataDeque.append({ "data": data, "createdTime": time.time() })
        self.cnt += 1
        if self.cnt % 10000 == 0:
            thread = PickleDumpingThread(self.dataDeque)
            self.connect(thread, QtCore.SIGNAL("threadFinished(int)"), self.threadFinished)
            thread.start() 
            self.threadHandler[thread.currentThreadId()] = thread

    @QtCore.pyqtSlot(int)
    def threadFinished(id):
        del self.threadHandler[id]

self.threadHandler is just to know how many threads are still running, you can get rid of it and threadFinished method self.threadHandler只知道还有多少线程在运行，你可以摆脱它和threadFinished方法

Answer 2

The problem was that I was not using QThread properly. 问题是我没有正确使用QThread 。

The result of printing 印刷的结果

print("(Current Thread)", QThread.currentThread(),"\n")
print("(Current Thread)", int(QThread.currentThreadId()),"\n")

noticed me that the PickleDumpingThread I created was running in the main thread, not in some seperated thread. 注意到我创建的PickleDumpingThread是在主线程中运行的，而不是在某个单独的线程中运行。

The reason of this is run() is the only function in QThread that runs in seperate thread, so method like savePickle in QThread run in main thread. 原因是run()是QThread中唯一一个在单独线程中运行的函数，所以QThread中的savePickle方法在主线程中运行。

First Solution 第一解决方案

The proper usage of using signal was using Worker as following. 使用信号的正确用法是使用Worker如下。

from PyQt5.QtCore import QThread
class GenericThread(QThread):
    def run(self, *args):
       #  print("Current Thread: (GenericThread)", QThread.currentThread(),"\n")
        self.exec_()

class PickleDumpingWorker(QObject):
    pickleDumpingSignal = pyqtSignal(dict)
    def __init__(self):
        super().__init__()
        self.pickleDumpingSignal[dict].connect(self.savePickle)

    def savePickle(self, signal_dict)
        pickle.dump(signal_dict["deque"], open(file, "wb"))

pickleDumpingThread = GenericThread()
pickleDumpingThread.start()

pickleDumpingWorker = PickleDumpingWorker()
pickleDumpingWorker.moveToThread(pickleDumpingThread)

class Analyzer():
    def __init__(self):
        self.cnt = 0
        self.dataDeque = deque(MAXLENGTH=10000)
    def onData(self, data):
        self.dataDeque.append({
            "data": data, 
            "createdTime": time.time()
        })
        self.cnt += 1
        if self.cnt % 10000 == 0:
            pickleDumpingWorker.pickleDumpingSignal.emit({
                "action": savePickle,
                "deque": self.dataDeque
            })
            # pickle.dump(dataDeque, open(file, 'wb'))

This solution worked (pickle was dumped in seperate thread), but drawback of it is the data stream still delays about 0.5~1 seconds because of signal emit() function. 这个解决方案有效（pickle被拆分成单独的线程），但缺点是数据流由于信号emit（）函数仍然延迟大约0.5~1秒。

I found the best solution for my case is @PYPL 's code, but the code needs a few modifications to work. 我发现我的案例的最佳解决方案是@PYPL的代码，但代码需要一些修改才能工作。

Final Solution 最终解决方案

Final solution is modifying @PYPL 's following code 最终解决方案是修改@PYPL的以下代码

thread = PickleDumpingThread(self.dataDeque)
thread.start()

to 至

self.thread = PickleDumpingThread(self.dataDeque)
self.thread.start()

The original code have some runtime error. 原始代码有一些运行时错误。 It seems like thread is being garbage collected before it dumps the pickle because there's no reference to that thread after onData() function is finished. 似乎线程在转储pickle之前被垃圾收集，因为在onData()函数完成后没有对该线程的引用。

Referencing the thread by adding self.thread solved this issue. 通过添加self.thread引用该线程解决了这个问题。

Also, it seems that the old PickleDumpingThread is being garbage collected after new PickleDumpingThread is being referenced by self.thread (because the old PickleDumpingThread loses its reference). 此外，似乎老PickleDumpingThread之后新收集被垃圾PickleDumpingThread正在被引用self.thread （因为老PickleDumpingThread失去其参考）。

However, this claim is not verified (as I don't know how to view current active thread).. 但是，此声明未经验证（因为我不知道如何查看当前活动的线程）..

Whatever, the problem is solved by this solution. 无论如何，这个解决方案解决了这个问题。

EDIT 编辑

My final solution have delay too. 我的最终解决方案也有延迟。 It takes some amount of time to call Thread.start().. 调用Thread.start（）需要一些时间。

The real final solution I choosed is running infinite loop in thread and monitor some variables of that thread to determine when to save pickle. 我选择的真正的最终解决方案是在线程中运行无限循环并监视该线程的一些变量以确定何时保存pickle。 Just using infinite loop in thread takes a lots of cpu, so I added time.sleep(0.1) to decrease the cpu usage. 在线程中使用无限循环需要大量的cpu，所以我添加了time.sleep（0.1）来减少cpu的使用。

FINAL EDIT 最终编辑

OK..My 'real final solution' also had delay.. Even though I moved dumping job to another QThread, the main thread still have delay about pickle dumping time! 好吧。我的'真正的最终解决方案'也有延迟..即使我将倾销工作转移到另一个QThread，主线程仍然有关于泡菜倾销时间的延迟！ That was weird. 那很奇怪。

But I found the reason. 但我找到了原因。 The reason was neither emit() performance nor whatever I thought. 原因既不是发射（）性能也不是我想的。

The reason was, embarrassingly, python's Global Interpreter Lock prevents two threads in the same process from running Python code at the same time . 原因是，令人尴尬的是， python的Global Interpreter Lock会阻止同一进程中的两个线程同时运行Python代码。

So probably I should use multiprocessing module in this case. 所以我可能应该在这种情况下使用多处理模块。

I'll post the result after modifying my code to use multiprocessing module. 我会在修改代码后发布结果以使用多处理模块。

Edit after using multiprocessing module and future attempts 使用multiprocessing模块和将来的尝试后编辑

Using multiprocessing module 使用multiprocessing模块

Using multiprocessing module solved the issue of running python code concurrently, but the new essential problem arised. 使用multiprocessing模块解决了并发运行python代码的问题，但出现了新的基本问题。 The new problem was 'passing shared memory variables between processes takes considerable amount of time' (in my case, passing deque object to child process took 1~2 seconds). 新问题是'在进程之间传递共享内存变量需要相当长的时间'（在我的情况下，将deque对象传递给子进程需要1~2秒）。 I found that this problem cannot be removed as long as I use multiprocessing module. 我发现只要我使用multiprocessing模块就无法解决这个问题。 So I gave up to use `multiprocessing module 所以我放弃了使用`多处理模块

Possible future attempts 可能的未来尝试

1. Doing only File I/O in QThread 1.在QThread仅执行文件I / O.

The essential problem of pickle dumping is not writing to file, but serializing before writing to file. pickle dumping的基本问题不是写入文件，而是在写入文件之前进行序列化。 Python releases GIL when it writes to file, so disk I/O can be done concurrently in QThread . Python在写入文件时会释放GIL，因此磁盘I / O可以在QThread同时完成。 The problem is, serializing deque object to string before writing to file in pickle.dump method takes some amount of time, and during this moment, main thread is going to be blocked because of GIL. 问题是，在pickle.dump方法中写入文件之前将deque对象序列化为字符串需要花费一些时间，在此期间，主线程将因GIL而被阻塞。

Hence, following approach will effectively decrease the length of delay. 因此，以下方法将有效地减少延迟的长度。

We somehow stringify the data object every time when onData() is called and push it to deque object 每次调用onData()时，我们都会以某种方式对数据对象进行字符串化，并将其推送到deque对象
In PickleDumpingThread , just join the list(deque) object to stringify the deque object. 在PickleDumpingThread ，只需join list(deque)对象来对deque对象进行字符串化。
file.write(stringified_deque_object) . file.write(stringified_deque_object) 。 This can be done concurrently. 这可以同时完成。

The step 1 takes really small time so it almost non-block the main thread. 步骤1需要很短的时间，因此它几乎不会阻塞主线程。 The step 2 might take some time, but it obviously takes smaller time than serializing python object in pickle.dump method. 第2步可能需要一些时间，但显然比在pickle.dump方法中序列化python对象花费的时间更短。 The step 3 doesn't block main thread. 步骤3不阻止主线程。

2. Using C extension 2.使用C扩展名

We can manually release the GIL and reacquire the GIL in our custom C-extension module. 我们可以手动释放GIL并在我们的自定义C扩展模块中重新获取GIL。 But this might be dirty. 但这可能很脏。

3. Porting CPython to Jython or IronPython 3.将CPython移植到Jython或IronPython

Jython and IronPython are other python implementations using Java and C#, respectively. Jython和IronPython是分别使用Java和C＃的其他python实现。 Hence, they don't use GIL in their implementation, which means that thread really works like thread. 因此，他们在实现中不使用GIL，这意味着thread确实像线程一样工作。 One problem is PyQt is not supported in these implementations.. 一个问题是这些实现不支持PyQt ..

4. Porting to another language 4.移植到另一种语言

.. ..

Note: 注意：

json.dump also took 1~2 seconds for my data. json.dump对我的数据也花了1~2秒。
Cython is not an option for this case. 对于这种情况，Cython不是一个选项。 Although Cython has with nogil: , only non-python object can be accessed in that block ( deque object cannot be accessed in that block) and we can't use pickle.dump method in that block. 虽然Cython有with nogil:只能在该块中访问非python对象（无法在该块中访问deque对象），并且我们不能在该块中使用pickle.dump方法。

Answer 3

When the GIL is a problem the workaround is to subdivide the task into chunks in such a way that you can refresh the GUI between chunks. 当GIL出现问题时，解决方法是将任务细分为块，以便您可以在块之间刷新GUI。

Eg say you have one huge list of size S to dump, then you could try defining a class that derives from list and overrides getstate to return N subpickle objects, each one an instance of a class say Subpickle, containing S/N items of your list. 例如，假设你有一个庞大的S大小列表要转储，那么你可以尝试定义一个派生自list的类并覆盖getstate返回N个subpickle对象，每个类的一个实例说Subpickle，包含你的S / N项名单。 Each subpickle exists only while pickling, and defines getstate to do 2 things: 每个subpickle只在pickle时存在，并定义getstate做两件事：

call qApp.processEvents() on gui, and 在gui上调用qApp.processEvents（），和
return the sublist of S/N items. 返回S / N项的子列表。

While unpickling, each subpickle will refresh GUI and take the list of items; 在unpickling时，每个subpickle将刷新GUI并获取项目列表; at end the total list is recreated in the original object from all the subpickles it will receive in its setstate. 最后，在原始对象中，它将在其setstate中接收的所有子列中重新创建总列表。

You should abstract out the call to process events in case you want to unpickle the pickle in a console app (or non-pyqt gui). 如果你想在控制台应用程序（或非pyqt gui）中取消选择pickle，你应该抽象出处理事件的调用。 You would do this by defining a class-wide attribute on Subpickle, say process_events, to be None by default; 你可以通过在Subpickle上定义一个类范围的属性，比如process_events，默认为None; if not None, the setstate calls it as a function. 如果不是None，则setstate将其称为函数。 So by default there is no GUI refreshing between the subpickles, unless the app that unpikles sets this attribute to a callable before unpickling starts. 因此，默认情况下，子片段之间没有GUI刷新，除非unpickles的应用程序在unpickling开始之前将此属性设置为可调用。

This strategy will give your GUi a chance to redraw during the unpickling process (and with only one thread, if you want). 此策略将使您的GUi有机会在unpickling过程中重绘（如果需要，只有一个线程）。

Implementation depends on your exact data, but here is an example that demonstrates the principles for a large list: 实现取决于您的确切数据，但这是一个演示大型列表原则的示例：

import pickle

class SubList:
    on_pickling = None

    def __init__(self, sublist):
        print('SubList', sublist)
        self.data = sublist

    def __getstate__(self):
        if SubList.on_pickling is not None:
            print('SubList pickle state fetch: calling sub callback')
            SubList.on_pickling()
        return self.data

    def __setstate__(self, obj):
        if SubList.on_pickling is not None:
            print('SubList pickle state restore: calling sub callback')
            SubList.on_pickling()
        self.data = obj


class ListSubPickler:
    def __init__(self, data: list):
        self.data = data

    def __getstate__(self):
        print('creating SubLists for pickling long list')
        num_chunks = 10
        span = int(len(self.data) / num_chunks)
        SubLists = [SubList(self.data[i:(i + span)]) for i in range(0, len(self.data), span)]
        return SubLists

    def __setstate__(self, subpickles):
        self.data = []
        print('restoring Pickleable(list)')
        for subpickle in subpickles:
            self.data.extend(subpickle.data)
        print('final', self.data)

def refresh():
    # do something: refresh GUI (for example, qApp.processEvents() for Qt), show progress, etc
    print('refreshed')

data = list(range(100))  # your large data object
list_pickler = ListSubPickler(data)
SubList.on_pickling = refresh

print('\ndumping pickle of', list_pickler)
pickled = pickle.dumps(list_pickler)

print('\nloading from pickle')
new_list_pickler = pickle.loads(pickled)
assert new_list_pickler.data == data

print('\nloading from pickle, without on_pickling')
SubList.on_pickling = None
new_list_pickler = pickle.loads(pickled)
assert new_list_pickler.data == data

Easy to apply to dict, or even to make it adapt to the type of data it receives by using isinstance. 易于应用于dict，甚至可以使其适应使用isinstance接收的数据类型。

如何在不阻塞流的情况下保存数据流中的数据？（PyQt5信号emit（）性能）

问题描述

3 个解决方案

解决方案1
0 2016-07-28 11:35:58

解决方案2
0 2016-07-29 03:03:30

解决方案3
0 2016-09-10 00:13:39

如何在不阻塞流的情况下保存数据流中的数据？ （PyQt5信号emit（）性能）

问题描述

3 个解决方案

解决方案1 0 2016-07-28 11:35:58

解决方案2 0 2016-07-29 03:03:30

解决方案3 0 2016-09-10 00:13:39

如何在不阻塞流的情况下保存数据流中的数据？（PyQt5信号emit（）性能）

解决方案1
0 2016-07-28 11:35:58

解决方案2
0 2016-07-29 03:03:30

解决方案3
0 2016-09-10 00:13:39