简体   繁体   中英

Python performance - best parallelism approach

I am implementing a Python script that needs to keep sending 1500+ packets in parallel in less than 5 seconds each.

In a nutshell what I need is:

def send_pkts(ip):
    #craft packet
    while True:
        #send packet
        time.sleep(randint(0,3))

for x in list[:1500]:
    send_pkts(x)
    time.sleep(randint(1,5))

I have tried the simple single-threaded, multithreading, multiprocessing and multiprocessing+multithreading forms and had the following issues:

  1. Simple single-threaded: The " for delay" seems to compromise the "5 seconds" dependency.
  2. Multithreading: I think I could not accomplish what I desire due to Python GIL limitations.
  3. Multiprocessing: That was the best approach that seemed to work. However, due to excessive quantity of process the VM where I am running the script freezes (of course, 1500 process running). Thus becoming impractical.
  4. Multiprocessing+Multithreading: In this approach I created less process with each of them calling some threads (lets suppose: 10 process calling 150 threads each). It was clear that the VM is not freezing as fast as approach number 3, however the most "concurrent packet sending" I could reach was ~800. GIL limitations? VM limitations? In this attempt I also tried using Process Pool but the results where similar.

Is there a better approach I could use to accomplish this task?

[1] EDIT 1:

 def send_pkt(x):
     #craft pkt
     while True:
         #send pkt
         gevent.sleep(0)

 gevent.joinall([gevent.spawn(send_pkt, x) for x in list[:1500]])

[2] EDIT 2 (gevent monkey-patching):

from gevent import monkey; monkey.patch_all()

jobs = [gevent.spawn(send_pkt, x) for x in list[:1500]]
gevent.wait(jobs)
#for send_pkt(x) check [1]

However I got the following error: "ValueError: filedescriptor out of range in select()" . So I checked my system ulimit (Soft and Hard both are maximum: 65536). After, I checked it has something to do with select() limitations over Linux (1024 fds maximum). Please check: http://man7.org/linux/man-pages/man2/select.2.html (BUGS section) - In orderto overcome that I should use poll() ( http://man7.org/linux/man-pages/man2/poll.2.html ) instead. But with poll() I return to same limitations: as polling is a "blocking approach".

Regards,

When using parallelism in Python a good approach is to use either ThreadPoolExecutor or ProcessPoolExecutor from https://docs.python.org/3/library/concurrent.futures.html#module-concurrent.futures these work well in my experience.

an example of threadedPoolExecutor that can be adapted for your use.

import concurrent.futures
import urllib.request
import time

IPs= ['168.212. 226.204',
        '168.212. 226.204',
        '168.212. 226.204',
        '168.212. 226.204',
        '168.212. 226.204']

def send_pkt(x):
  status = 'Failed'
  while True:
    #send pkt
    time.sleep(10)
    status = 'Successful'
    break
  return status

with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
    future_to_ip = {executor.submit(send_pkt, ip): ip for ip in IPs}
    for future in concurrent.futures.as_completed(future_to_ip):
        ip = future_to_ip[future]
        try:
            data = future.result()
        except Exception as exc:
            print('%r generated an exception: %s' % (ip, exc))
        else:
            print('%r send %s' % (url, data))

Your result in option 3: "due to excessive quantity of process the VM where I am running the script freezes (of course, 1500 process running)" could bear further investigation. I believe it may be underdetermined from the information gathered so far whether this is better characterized as a shortcoming of the multiprocessing approach, or a limitation of the VM.

One fairly simple and straightforward approach would be to run a scaling experiment: rather than either having all sends happen from individual processes or all from the same, try intermediate values. Time it how long it takes to split the workload in half between two processes, or 4, 8, so on.

While doing that it may also be a good idea to run a tool like xperf on Windows or oprofile on Linux to record whether these different choices of parallelism are leading to different kinds of bottlenecks, for example thrashing the CPU cache, running the VM out of memory, or who knows what else. Easiest way to say is to try it.

Based on prior experience with these types of problems and general rules of thumb, I would expect the best performance to come when the number of multiprocessing processes is less than or equal to the number of available CPU cores (either on the VM itself or on the hypervisor). That is however assuming that the problem is CPU bound; it's possible performance would still be higher with more than #cpu processes if something blocks during packet sending that would allow better use of CPU time if interleaved with other blocking operations. Again though, we don't know until some profiling and/or scaling experiments are done.

You are correct that python is single-threaded, however your desired task (sending network packets) is considered IO-bound operation, therefor a good candidate for multi-threading. Your main thread is not busy while the packets are transmitting, as long as your write your code with async in mind.

Take a look at the python docs on async tcp networking - https://docs.python.org/3/library/asyncio-protocol.html#tcp-echo-client .

If the bottleneck is http based ("sending packets") then the GIL actually shouldn't be too much of a problem.

If there is computation happening within python as well, then the GIL may get in the way and, as you say, process-based parallelism would be preferred.

You do not need one process per task! This seems to be the oversight in your thinking. With python's Pool class, you can easily create a set of workers which will receive tasks from a queue.


import multiprocessing


def send_pkts(ip):
   ...


number_of_workers = 8

with multiprocessing.Pool(number_of_workers) as pool:
    pool.map(send_pkts, list[:1500])

You are now running number_of_workers + 1 processes (the workers + the original process) and the N workers are running the send_pkts function concurrently.

The main issue keeping you from achieving your desired performance is the send_pkts() method. It doesn't just send the packet, it also crafts the packet:

def send_pkts(ip):
#craft packet
while True:
    #send packet
    time.sleep(randint(0,3))

While sending a packet is almost certainly an I/O bound task, crafting a packet is almost certainly a CPU bound task. This method needs to be split into two tasks:

  1. craft a packet
  2. send a packet

I've written a basic socket server and a client app that crafts and sends packets to the server. The idea is to have a separate process which crafts the packets and puts them into a queue. There is a pool of threads that share the queue with the packet crafting process. These threads pull packets off of the queue and send them to the server. They also stick the server's responses into another shared queue but that's just for my own testing and not relevant to what you're trying to do. The threads exit when they get a None ( poison pill ) from the queue.

server.py:

import argparse
import socketserver
import time


if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("--host", type=str, help="bind to host")
    parser.add_argument("--port", type=int, help="bind to port")
    parser.add_argument("--packet-size", type=int, help="size of packets")
    args = parser.parse_args()
    HOST, PORT = args.host, args.port

    class MyTCPHandler(socketserver.BaseRequestHandler):
        def handle(self):
            time.sleep(1.5)
            data = self.request.recv(args.packet_size)
            self.request.sendall(data.upper())

    with socketserver.ThreadingTCPServer((HOST, PORT), MyTCPHandler) as server:
        server.serve_forever()

client.py:

import argparse
import logging
import multiprocessing as mp
import os
import queue as q
import socket
import time
from threading import Thread


def get_logger():
    logger = logging.getLogger("threading_example")
    logger.setLevel(logging.INFO)

    fh = logging.FileHandler("client.log")
    fmt = '%(asctime)s - %(threadName)s - %(levelname)s - %(message)s'
    formatter = logging.Formatter(fmt)
    fh.setFormatter(formatter)

    logger.addHandler(fh)
    return logger


class PacketMaker(mp.Process):
    def __init__(self, result_queue, max_packets, packet_size, num_poison_pills, logger):
        mp.Process.__init__(self)
        self.result_queue = result_queue
        self.max_packets = max_packets
        self.packet_size = packet_size
        self.num_poison_pills = num_poison_pills
        self.num_packets_made = 0
        self.logger = logger

    def run(self):
        while True:
            if self.num_packets_made >= self.max_packets:
                for _ in range(self.num_poison_pills):
                    self.result_queue.put(None, timeout=1)
                self.logger.debug('PacketMaker exiting')
                return
            self.result_queue.put(os.urandom(self.packet_size), timeout=1)
            self.num_packets_made += 1


class PacketSender(Thread):
    def __init__(self, task_queue, result_queue, addr, packet_size, logger):
        Thread.__init__(self)
        self.task_queue = task_queue
        self.result_queue = result_queue
        self.server_addr = addr
        self.packet_size = packet_size
        self.sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
        self.sock.connect(addr)
        self.logger = logger

    def run(self):
        while True:
            packet = self.task_queue.get(timeout=1)
            if packet is None:
                self.logger.debug("PacketSender exiting")
                return
            try:
                self.sock.sendall(packet)
                response = self.sock.recv(self.packet_size)
            except socket.error:
                self.sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
                self.sock.connect(self.server_addr)
                self.sock.sendall(packet)
                response = self.sock.recv(self.packet_size)
            self.result_queue.put(response)


if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument('--num-packets', type=int, help='number of packets to send')
    parser.add_argument('--packet-size', type=int, help='packet size in bytes')
    parser.add_argument('--num-threads', type=int, help='number of threads sending packets')
    parser.add_argument('--host', type=str, help='name of host packets will be sent to')
    parser.add_argument('--port', type=int, help='port number of host packets will be sent to')
    args = parser.parse_args()

    logger = get_logger()
    logger.info(f"starting script with args {args}")
    
    packets_to_send = mp.Queue(args.num_packets + args.num_threads)
    packets_received = q.Queue(args.num_packets)
    producers = [PacketMaker(packets_to_send, args.num_packets, args.packet_size, args.num_threads, logger)]
    senders = [PacketSender(packets_to_send, packets_received, (args.host, args.port), args.packet_size, logger)
               for _ in range(args.num_threads)]
    start_time = time.time()
    logger.info("starting workers")
    for worker in senders + producers:
        worker.start()
    for worker in senders:
        worker.join()
    logger.info("workers finished")
    end_time = time.time()
    print(f"{packets_received.qsize()} packets received in {end_time - start_time} seconds")

run.sh:

#!/usr/bin/env bash

for i in "$@"
do
case $i in
    -s=*|--packet-size=*)
    packet_size="${i#*=}"
    shift 
    ;;
    -n=*|--num-packets=*)
    num_packets="${i#*=}"
    shift 
    ;;
    -t=*|--num-threads=*)
    num_threads="${i#*=}"
    shift 
    ;;
    -h=*|--host=*)
    host="${i#*=}"
    shift 
    ;;
    -p=*|--port=*)
    port="${i#*=}"
    shift 
    ;;
    *)
    ;;
esac
done

python3 server.py --host="${host}" \
                  --port="${port}" \
                  --packet-size="${packet_size}" &
server_pid=$!
python3 client.py --packet-size="${packet_size}" \
                  --num-packets="${num_packets}" \
                  --num-threads="${num_threads}" \
                  --host="${host}" \
                  --port="${port}"
kill "${server_pid}"

$ ./run.sh -s=1024 -n=1500 -t=300 -h=localhost -p=9999

1500 packets received in 4.70330023765564 seconds

$ ./run.sh -s=1024 -n=1500 -t=1500 -h=localhost -p=9999

1500 packets received in 1.5025699138641357 seconds

This result may be verified by changing the log level in client.py to DEBUG . Note that the script does take much longer than 4.7 seconds to complete. There is quite a lot of teardown required when using 300 threads, but the log makes it clear that the threads are done processing at 4.7 seconds.

Take all performance results with a grain of salt. I have no clue what system you're running this on. I will provide my relevant system stats: 2 Xeon X5550 @2.67GHz 24MB DDR3 @1333MHz Debian 10 Python 3.7.3


I'll address the issues with your attempts:

  1. Simple single-threaded: This is all but guaranteed to take at least 1.5 x num_packets seconds due to the randint(0, 3) delay
  2. Multithreading: The GIL is the likely bottleneck here, but it's likely because of the craft packet part rather than send packet
  3. Multiprocessing: Each process requires at least file descriptor so you're probably exceeding the user or system limit, but this could work if you change the appropriate settings
  4. Multiprocessing+multithreading: This fails for the same reason as #2, crafting the packet is probably CPU bound

The rule of thumb is: I/O bound - use threads, CPU bound - use processes

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM