简体   繁体   中英

How to save data from the data stream while not blocking the stream? (PyQt5 signal emit() performance)

I'm developing a PyQt5 application. In my application, it has a data stream, and its speed is about 5~20 data/sec.

Every time data arrives, the following onData() method of class Analyzer is called. (Following code is simplified code of my app)

class Analyzer():
    def __init__(self):
        self.cnt = 0
        self.dataDeque = deque(MAXLENGTH=10000)
    def onData(self, data):
        self.dataDeque.append({
            "data": data, 
            "createdTime": time.time()
        })
        self.cnt += 1
        if self.cnt % 10000 == 0:
            pickle.dump(dataDeque, open(file, 'wb'))

But the problem is, this dataDeque object is so large(50~150MB) so that dumping the pickle takes about 1~2 seconds.

During that moment(1~2 seconds), requests for calling onData() method got queued, and after 1~2 seconds, the queued requests call lots of onData() method at simultaneously, eventually distorts the createdTime of data.

To solve this problem, I edited my code to use Thread (QThread) to save the pickle.

The following code is the edited code.

from PickleDumpingThread import PickleDumpingThread
pickleDumpingThread = PickleDumpingThread()
pickleDumpingThread.start()

class Analyzer():
    def __init__(self):
        self.cnt = 0
        self.dataDeque = deque(MAXLENGTH=10000)
    def onData(self, data):
        self.dataDeque.append({
            "data": data, 
            "createdTime": time.time()
        })
        self.cnt += 1
        if self.cnt % 10000 == 0:
            pickleDumpingThread.pickleDumpingSignal.emit({
                "action": savePickle,
                "deque": self.dataDeque
            })
            # pickle.dump(dataDeque, open(file, 'wb'))

The following code is PickleDumpingThread class.

class PickleDumpingThread(QThread):
   def __init__(self):
       super().__init__()
       self.daemon = True
       self.pickleDumpingSignal[dict].connect(self.savePickle)

   def savePickle(self, signal_dict):
       pickle.dump(signal_dict["deque"], open(file, 'wb'))

I expected this newly edited code will dramatically decrease the stream blocking time(1~2 seconds), but this code still blocks the stream about 0.5~2 seconds.

It seems like pickleDumpingThread.pickleDumpingSignal.emit(somedict) takes 0.5~2 seconds.

My question is 3 things.

  1. Is signal emit() function's performance is not good like this?

  2. Is there any possible alternatives of emit() function in my case?

  3. Or is there any way to save pickle while not blocking the data stream? (any suggestion of modifying my code is highly appreciated)

Thank you for reading this long question!

something like this might work

class PickleDumpingThread(QThread):
   def __init__(self, data):
       super().__init__()
       self.data = data

   def run(self):
       pickle.dump(self.data["deque"], open(file, 'wb'))
       self.emit(QtCore.SIGNAL('threadFinished(int)'), self.currentThreadId())

class Analyzer():
    def __init__(self):
        self.cnt = 0
        self.dataDeque = deque(MAXLENGTH=10000)
        self.threadHandler = {}

    def onData(self, data):
        self.dataDeque.append({ "data": data, "createdTime": time.time() })
        self.cnt += 1
        if self.cnt % 10000 == 0:
            thread = PickleDumpingThread(self.dataDeque)
            self.connect(thread, QtCore.SIGNAL("threadFinished(int)"), self.threadFinished)
            thread.start() 
            self.threadHandler[thread.currentThreadId()] = thread

    @QtCore.pyqtSlot(int)
    def threadFinished(id):
        del self.threadHandler[id]

self.threadHandler is just to know how many threads are still running, you can get rid of it and threadFinished method

The problem was that I was not using QThread properly.

The result of printing

print("(Current Thread)", QThread.currentThread(),"\n")
print("(Current Thread)", int(QThread.currentThreadId()),"\n")

noticed me that the PickleDumpingThread I created was running in the main thread, not in some seperated thread.

The reason of this is run() is the only function in QThread that runs in seperate thread, so method like savePickle in QThread run in main thread.


First Solution

The proper usage of using signal was using Worker as following.

from PyQt5.QtCore import QThread
class GenericThread(QThread):
    def run(self, *args):
       #  print("Current Thread: (GenericThread)", QThread.currentThread(),"\n")
        self.exec_()

class PickleDumpingWorker(QObject):
    pickleDumpingSignal = pyqtSignal(dict)
    def __init__(self):
        super().__init__()
        self.pickleDumpingSignal[dict].connect(self.savePickle)

    def savePickle(self, signal_dict)
        pickle.dump(signal_dict["deque"], open(file, "wb"))

pickleDumpingThread = GenericThread()
pickleDumpingThread.start()

pickleDumpingWorker = PickleDumpingWorker()
pickleDumpingWorker.moveToThread(pickleDumpingThread)

class Analyzer():
    def __init__(self):
        self.cnt = 0
        self.dataDeque = deque(MAXLENGTH=10000)
    def onData(self, data):
        self.dataDeque.append({
            "data": data, 
            "createdTime": time.time()
        })
        self.cnt += 1
        if self.cnt % 10000 == 0:
            pickleDumpingWorker.pickleDumpingSignal.emit({
                "action": savePickle,
                "deque": self.dataDeque
            })
            # pickle.dump(dataDeque, open(file, 'wb'))

This solution worked (pickle was dumped in seperate thread), but drawback of it is the data stream still delays about 0.5~1 seconds because of signal emit() function.

I found the best solution for my case is @PYPL 's code, but the code needs a few modifications to work.


Final Solution

Final solution is modifying @PYPL 's following code

thread = PickleDumpingThread(self.dataDeque)
thread.start() 

to

self.thread = PickleDumpingThread(self.dataDeque)
self.thread.start() 

The original code have some runtime error. It seems like thread is being garbage collected before it dumps the pickle because there's no reference to that thread after onData() function is finished.

Referencing the thread by adding self.thread solved this issue.

Also, it seems that the old PickleDumpingThread is being garbage collected after new PickleDumpingThread is being referenced by self.thread (because the old PickleDumpingThread loses its reference).

However, this claim is not verified (as I don't know how to view current active thread)..

Whatever, the problem is solved by this solution.


EDIT

My final solution have delay too. It takes some amount of time to call Thread.start()..

The real final solution I choosed is running infinite loop in thread and monitor some variables of that thread to determine when to save pickle. Just using infinite loop in thread takes a lots of cpu, so I added time.sleep(0.1) to decrease the cpu usage.


FINAL EDIT

OK..My 'real final solution' also had delay.. Even though I moved dumping job to another QThread, the main thread still have delay about pickle dumping time! That was weird.

But I found the reason. The reason was neither emit() performance nor whatever I thought.

The reason was, embarrassingly, python's Global Interpreter Lock prevents two threads in the same process from running Python code at the same time .

So probably I should use multiprocessing module in this case.

I'll post the result after modifying my code to use multiprocessing module.

Edit after using multiprocessing module and future attempts

Using multiprocessing module

Using multiprocessing module solved the issue of running python code concurrently, but the new essential problem arised. The new problem was 'passing shared memory variables between processes takes considerable amount of time' (in my case, passing deque object to child process took 1~2 seconds). I found that this problem cannot be removed as long as I use multiprocessing module. So I gave up to use `multiprocessing module

Possible future attempts

1. Doing only File I/O in QThread

The essential problem of pickle dumping is not writing to file, but serializing before writing to file. Python releases GIL when it writes to file, so disk I/O can be done concurrently in QThread . The problem is, serializing deque object to string before writing to file in pickle.dump method takes some amount of time, and during this moment, main thread is going to be blocked because of GIL.

Hence, following approach will effectively decrease the length of delay.

  1. We somehow stringify the data object every time when onData() is called and push it to deque object

  2. In PickleDumpingThread , just join the list(deque) object to stringify the deque object.

  3. file.write(stringified_deque_object) . This can be done concurrently.

The step 1 takes really small time so it almost non-block the main thread. The step 2 might take some time, but it obviously takes smaller time than serializing python object in pickle.dump method. The step 3 doesn't block main thread.

2. Using C extension

We can manually release the GIL and reacquire the GIL in our custom C-extension module. But this might be dirty.

3. Porting CPython to Jython or IronPython

Jython and IronPython are other python implementations using Java and C#, respectively. Hence, they don't use GIL in their implementation, which means that thread really works like thread. One problem is PyQt is not supported in these implementations..

4. Porting to another language

..

Note:

  1. json.dump also took 1~2 seconds for my data.

  2. Cython is not an option for this case. Although Cython has with nogil: , only non-python object can be accessed in that block ( deque object cannot be accessed in that block) and we can't use pickle.dump method in that block.

When the GIL is a problem the workaround is to subdivide the task into chunks in such a way that you can refresh the GUI between chunks.

Eg say you have one huge list of size S to dump, then you could try defining a class that derives from list and overrides getstate to return N subpickle objects, each one an instance of a class say Subpickle, containing S/N items of your list. Each subpickle exists only while pickling, and defines getstate to do 2 things:

  • call qApp.processEvents() on gui, and
  • return the sublist of S/N items.

While unpickling, each subpickle will refresh GUI and take the list of items; at end the total list is recreated in the original object from all the subpickles it will receive in its setstate.

You should abstract out the call to process events in case you want to unpickle the pickle in a console app (or non-pyqt gui). You would do this by defining a class-wide attribute on Subpickle, say process_events, to be None by default; if not None, the setstate calls it as a function. So by default there is no GUI refreshing between the subpickles, unless the app that unpikles sets this attribute to a callable before unpickling starts.

This strategy will give your GUi a chance to redraw during the unpickling process (and with only one thread, if you want).

Implementation depends on your exact data, but here is an example that demonstrates the principles for a large list:

import pickle

class SubList:
    on_pickling = None

    def __init__(self, sublist):
        print('SubList', sublist)
        self.data = sublist

    def __getstate__(self):
        if SubList.on_pickling is not None:
            print('SubList pickle state fetch: calling sub callback')
            SubList.on_pickling()
        return self.data

    def __setstate__(self, obj):
        if SubList.on_pickling is not None:
            print('SubList pickle state restore: calling sub callback')
            SubList.on_pickling()
        self.data = obj


class ListSubPickler:
    def __init__(self, data: list):
        self.data = data

    def __getstate__(self):
        print('creating SubLists for pickling long list')
        num_chunks = 10
        span = int(len(self.data) / num_chunks)
        SubLists = [SubList(self.data[i:(i + span)]) for i in range(0, len(self.data), span)]
        return SubLists

    def __setstate__(self, subpickles):
        self.data = []
        print('restoring Pickleable(list)')
        for subpickle in subpickles:
            self.data.extend(subpickle.data)
        print('final', self.data)

def refresh():
    # do something: refresh GUI (for example, qApp.processEvents() for Qt), show progress, etc
    print('refreshed')

data = list(range(100))  # your large data object
list_pickler = ListSubPickler(data)
SubList.on_pickling = refresh

print('\ndumping pickle of', list_pickler)
pickled = pickle.dumps(list_pickler)

print('\nloading from pickle')
new_list_pickler = pickle.loads(pickled)
assert new_list_pickler.data == data

print('\nloading from pickle, without on_pickling')
SubList.on_pickling = None
new_list_pickler = pickle.loads(pickled)
assert new_list_pickler.data == data

Easy to apply to dict, or even to make it adapt to the type of data it receives by using isinstance.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM