简体   繁体   中英

python; asyncore handle_read; do I need a seperate thread?

From asyncore's documentation: https://docs.python.org/2/library/asyncore.html

import asyncore, socket

class HTTPClient(asyncore.dispatcher):

  def __init__(self, host, path):
      asyncore.dispatcher.__init__(self)
      self.create_socket(socket.AF_INET, socket.SOCK_STREAM)
      self.connect( (host, 80) )
      self.buffer = 'GET %s HTTP/1.0\r\n\r\n' % path

  def handle_connect(self):
      pass

  def handle_close(self):
      self.close()

  def handle_read(self):
      print self.recv(8192)

  def writable(self):
      return (len(self.buffer) > 0)

  def handle_write(self):
      sent = self.send(self.buffer)
      self.buffer = self.buffer[sent:]

  client = HTTPClient('www.python.org', '/')
  asyncore.loop()

Now suppose instead we have:

def handle_read(self):
    data = self.recv(8192)
    //SOME REALLY LONG AND COMPLICATED THING

Is this handled in Asyncore itself due to asyncore's polling/select methodlogy, or do I need to do:

def handle_read(self):
    data = self.recv(8192)
    h = Handler(data)
    h.start()

class Handler(threading.Thread):
    def __init__(self, data):
        threading.Thread.__init__(self)
        self.data = data
    def run():
        //LONG AND COMPLICATED THING WITH DATA

If I do need a thread, do I want h.join() after start ? It seems to work, but since join blocks, I'm not exactly sure why.

Short answers

Is this handled in Asyncore itself due to asyncore's polling/select methodlogy?

No, asyncore cannot handle by itself a long blocking task in handle_read() , since there is only one thread. The thread is doing some long job and it cannot be interrupted by the same thread.

However, such blocking implementation makes sense. The only issue is that the network transfer is slower. For example if the long task takes 1 second, then the maximum data transfer rate is 8192 bytes per second. Although the data rate is slower the network connection is stable and it works as expected. That is handled by TCP protocol implementation in the operating system kernel.

...or do I need to do...? If I do need a thread, do I want h.join() after start?

None of the above thread usages make sense. However, it is still possible to use helper thread to download data at maximum rate and to process that data in parallel, see below for explanations.


TCP protocol

TCP provides reliable, ordered, and error-checked delivery of a stream.

Data transfer:

Flow control — limits the rate a sender transfers data to guarantee reliable delivery. The receiver continually hints the sender on how much data can be received (controlled by the sliding window). When the receiving host's buffer fills, the next acknowledgment contains a 0 in the window size, to stop transfer and allow the data in the buffer to be processed.

...

When a receiver advertises a window size of 0, the sender stops sending data and starts the persist timer. The persist timer is used to protect TCP from a deadlock situation that could arise if a subsequent window size update from the receiver is lost, and the sender cannot send more data until receiving a new window size update from the receiver. When the persist timer expires, the TCP sender attempts recovery by sending a small packet so that the receiver responds by sending another acknowledgement containing the new window size.

So, when the data is not read from the socket due to long task in handle_read() the socket buffer becomes full. The TCP connection suspends and does not receive any new data packets. After recv() new data can be received, so the TCP ACK packet is sent to the sender to update TCP window size.

The similar behavior can be observed with file downloader applications when data transfer rate is limited by the settings. For example, if the limit is set to 1Kb/s the downloader may call recv(1000) once per second. Even if the physical network connection is able to send 1Mb/s, only 1Kb/s will be received. In that case it is possible to see by tcpdump or by Wireshark TCP Zero Window packets and TCP Window Update packets.


Although the application will work with long blocking task, the network connection is usually considered as bottleneck . So, it may be better to release network as soon as possible.

If the long task takes much longer then data download the simplest solution is to download everything and only then process downloaded data. However it may be not acceptable if time for data download is commensurate with time for data processing task. For example 1 hour for download + 2 hours for processing can be done in 2 hours if processing is performed in parallel to download.


Thread for each data block

If a new thread is created in handle_read() and the main thread does not wait for the helper thread to finish (without join() ) the application may create huge number threads. Note that handle_read() may be called thousands times per second and if each long task takes more then second the application may create hundreds of threads and finally it may be killed by an exception. Such solution does not make sense since there is no control over number of threads and also data blocks handled by those threads are also random. The function recv(8192) receives at most 8192 bytes, but it also may receive smaller data block.


Thread for each data block and join with main thread

It does not make any sense to create a thread and immediately block execution of the main thread by join() , since such solution is not better than just initial solution without any thread.

Some helper thread and later join() may be used to do something in parallel. For example:

# Start detached thread
h.start()
# Do something in parallel to that thread
# ...
# Wait the thread to finish
h.join()

However, here it is not such case.


Persistent worker thread and producer-consumer data exchange

It is possible to create one persistent worker thread (or several to use all CPU cores) that will be responsible for data processing. It should be started before asyncore.loop() , for example:

handler = Handler()
asyncore.loop()

Now once the handler thread is ready it can take all downloaded data for processing and at the same time the main thread may continue with data download. While the handler thread is busy the downloader appends data to its data buffer. It is needed to take care about proper synchronization between threads:

  • if downloaded data is appended to downloader buffer , the handler thread should wait before it will have access to that buffer ;
  • if the handler is reading the buffer the downloader should wait before it will be able to append to buffer ;
  • if the handler has nothing to do and the buffer is empty it should freeze and wait for new downloaded data.

That can be achieved using threading condition object and the producer-consumer example:

# create a new condition variable on __init__
cv = threading.Condition()

# Consume one item by Handler
cv.acquire()
while not an_item_is_available():
    cv.wait()
get_an_available_item()
cv.release()
# DO SOME REALLY LONG AND COMPLICATED THING

# Produce one item by Downloader
cv.acquire()
make_an_item_available()
cv.notify()
cv.release()

Here make_an_item_available() may be related to appending downloaded data to buffer and or setting some other shared state variables (for example in handle_close() ). The handler thread should do its long task after cv.release() , so during that long task the downloader is able to acquire the lock and append new data to the buffer .

This is along the same lines as a question I had previously asked here .

If you have a LONG AND COMPLICATED THING WITH DATA that you need to achieve, executing it within the event loop will block the event loop from doing anything else until your task has completed.

The same is true if you spawn a thread and then join() it ( join simply blocks execution until the joined thread is finished); however, if you spawn a worker thread and let it run to completion on its own, then the event loop is free to continue processing while your long task completes in parallel.

I am posting my own answer because it was inspired by Orest Hera's answer, but because I have knowledge of my workload, it is a slight variant.

My workload is such that requests can arrive in bursts, but these burts are sporadic (non-stationary). Moreover, they need to be processed in order they are received. So, here is what I did:

#! /usr/bin/env python3

import asyncore #https://docs.python.org/2/library/asyncore.html
import socket
import threading    
import queue
import time

fqueue = queue.Queue()

class Handler(threading.Thread):
    def __init__(self):
        threading.Thread.__init__(self)
        self.keep_reading = True

    def run(self):
        while self.keep_reading:
            if fqueue.empty():
                time.sleep(1)
            else: 
                #PROCESS
    def stop(self):
        self.keep_reading = False


class Listener(asyncore.dispatcher): #http://effbot.org/librarybook/asyncore.htm
    def __init__(self, host, port):
        asyncore.dispatcher.__init__(self)
        self.create_socket(socket.AF_INET, socket.SOCK_STREAM)
        self.connect((host, port))


    def handle_read(self):
        data = self.recv(40) #pretend it always waits for 40 bytes
        fqueue.put(data)

    def start(self):
        try:
            h = Handler()
            h.start()
            asyncore.loop()
        except KeyboardInterrupt:
            pass
        finally:
            h.stop() 

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM