简体   繁体   English

Python 多线程/多处理在 concurrent.futures 中非常慢

[英]Python multithreading/multiprocessing very slow with concurrent.futures

I am trying to use multithreading and/or multiprocessing to speed up my script somewhat.我正在尝试使用多线程和/或多处理来加快我的脚本速度。 Essentially I have a list of 10,000 subnets I read in from CSV, that I want to convert into an IPv4 object and then store in an array.本质上,我有一个从 CSV 读取的 10,000 个子网的列表,我想将其转换为 IPv4 object,然后存储在一个数组中。

My base code is as follows and executes in roughly 300ms:我的基本代码如下并在大约 300 毫秒内执行:

aclsConverted = []
def convertToIP(ip):
    aclsConverted.append(ipaddress.ip_network(ip))

for y in acls:
    convertToIP(y['srcSubnet'])

If I try with concurrent.futures Threads it works but is 3-4x as slow, as follows:如果我尝试使用 concurrent.futures 线程,它可以工作,但速度会慢 3-4 倍,如下所示:

aclsConverted = []
def convertToIP(ip):
    aclsConverted.append(ipaddress.ip_network(ip))

with concurrent.futures.ThreadPoolExecutor(max_workers=20) as executor:
    for y in acls:
        executor.submit(convertToIP,y['srcSubnet'])

Then if I try with concurrent.futures Process it 10-15x as slow and the array is empty.然后,如果我尝试使用 concurrent.futures 处理它 10-15 倍的速度并且数组是空的。 Code is as follows代码如下

aclsConverted = []
def convertToIP(ip):
    aclsConverted.append(ipaddress.ip_network(ip))

with concurrent.futures.ProcessPoolExecutor(max_workers=20) as executor:
    for y in acls:
        executor.submit(convertToIP,y['srcSubnet'])

The server I am running this on has 28 physical cores.我运行它的服务器有 28 个物理内核。

Any suggestions as to what I might be doing wrong will be gratefully received!任何关于我可能做错的建议将不胜感激!

If tasks are too small, then the overhead of managing multiprocessing / multithreading is often more expensive than the benefit of running tasks in parallel.如果任务太小,那么管理多处理/多线程的开销通常比并行运行任务的好处更昂贵。

You might try following:您可以尝试以下操作:

Just to create two processes ( not threads!!! ), one treating the first 5000 subnets, the other the the other 5000 subnets.只是为了创建两个进程(不是线程!!! ),一个处理前 5000 个子网,另一个处理其他 5000 个子网。

There you might be able to see some performance improvement.在那里您可能会看到一些性能改进。 but the tasks you perform are not that CPU or IO intensive, so not sure it will work.但是您执行的任务不是 CPU 或 IO 密集型的,所以不确定它是否会工作。

Multithreading in Python on the other hand will have no performance improvement at all for tasks, that have no IO and that are pure python code.另一方面,Python 中的多线程对于没有 IO 并且是纯 python 代码的任务根本没有性能提升

The reason is the infamous GIL (global interpreter lock).原因是臭名昭著的 GIL(全局解释器锁)。 In python you can never execute two python byte codes in parallel within the same process.在 python 中,您永远不能在同一进程中并行执行两个 python 字节代码。

Multithreading in python makes still sense for tasks, that have IO (performing network accesses), that perform sleeps, that call modules, that are implemented in C and that do release the GIL. python 中的多线程对于具有 IO(执行网络访问)、执行睡眠、调用模块、在 C 中实现并发布 G 的任务仍然有意义。 numpy for example releases the GIL and is thus a good candidate for multi threading例如 numpy 发布了 GIL,因此是多线程的良好候选者

Threading module is for IN/OUT operations, it can't accelarate conversion of strings to ip address instance in no way.线程模块用于IN/OUT操作,它绝对不能加速字符串到ip地址实例的转换。 When you create threads you spend time on them but do not get any bonus, that's why program without threads is faster.当您创建线程时,您会花时间在它们上面但没有得到任何奖励,这就是为什么没有线程的程序会更快。

Multiprocessing module is for heavy computations like find sum of square of all nambers in range from 0 to 1 000 000 000... You operation is not as heavy.多处理模块用于繁重的计算,例如查找 0 到 1 000 000 000 范围内所有数字的平方和...您的操作并不那么繁重。 Multiprocessing module need to serialize data to send it from one process to another, so it is expensive operation, in addition process spawn is also expenive as a result it is even slower then threading in the case.多处理模块需要序列化数据以将其从一个进程发送到另一个进程,因此这是一项昂贵的操作,此外进程生成也很昂贵,因此在这种情况下它甚至比线程更慢。

I guess that the only way to speed up program is to read csv faster with something like pandas, in addition you can then convert string to ip inside pandas dataframe with apply method. I guess that the only way to speed up program is to read csv faster with something like pandas, in addition you can then convert string to ip inside pandas dataframe with apply method.

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM