简体   繁体   English

os.sched_getaffinity(0) 与 os.cpu_count()

[英]os.sched_getaffinity(0) vs os.cpu_count()

So, I know the difference between the two methods in the title, but not the practical implications.所以,我知道标题中两种方法之间的区别,但不知道实际含义。

From what I understand: If you use more NUM_WORKERS than are cores actually available, you face big performance drops because your OS constantly switches back and forth trying to keep things in parallel.据我了解:如果您使用的 NUM_WORKERS 数量多于实际可用的内核数,您将面临性能大幅下降,因为您的操作系统不断来回切换以保持并行。 Don't know how true this is, but I read it here on SO somewhere from someone smarter than me.不知道这是多么真实,但我在某处从比我更聪明的人那里读到了它。

And in the docs for os.cpu_count() it says:os.cpu_count()的文档中,它说:

Return the number of CPUs in the system.返回系统中的 CPU 数量。 Returns None if undetermined.如果未确定,则返回 None。 This number is not equivalent to the number of CPUs the current process can use.这个数字不等于当前进程可以使用的 CPU 数量。 The number of usable CPUs can be obtained with len(os.sched_getaffinity(0))可以使用 len(os.sched_getaffinity(0)) 获得可用 CPU 的数量

So, I'm trying to work out what the "system" refers to if there can be more CPUs usable by a process than there are in the "system".所以,我试图弄清楚“系统”指的是什么,如果一个进程可用的 CPU 比“系统”中的多。

I just want to safely and efficiently implement multiprocessing.pool functionality.我只想安全有效地实现multiprocessing.pool功能。 So here is my question summarized:所以这是我总结的问题:

What are the practical implications of:有哪些实际意义:

NUM_WORKERS = os.cpu_count() - 1
# vs.
NUM_WORKERS = len(os.sched_getaffinity(0)) - 1

The -1 is because I've found that my system is a lot less laggy if I try to work while data is being processed. -1是因为我发现如果我在处理数据的同时尝试工作,我的系统会减少很多延迟。

These two functions are very different and NUM_WORKERS = os.sched_getaffinity(0) - 1 would just instantly fail with TypeError because you try to subtract an integer from a set.这两个函数非常不同, NUM_WORKERS = os.sched_getaffinity(0) - 1会立即因TypeError而失败,因为您尝试从集合中减去一个整数。 While os.cpu_count() tells you how many cores the system has, os.sched_getaffinity(pid) tells you on which cores a certain thread/process is allowed to run. os.cpu_count()告诉你系统有多少核, os.sched_getaffinity(pid)告诉你某个线程/进程允许在哪些核上运行。


os.cpu_count()

os.cpu_count() shows the number of available cores as known to the OS ( virtual cores).os.cpu_count()显示操作系统已知的可用内核数(虚拟内核)。 Most likely you have half this number of physical cores.很可能您的物理内核数量是这个数量的一半。 If it makes sense to use more processes than you have physical cores, or even more than virtual cores, depends very much on what you are doing.如果使用比物理内核更多的进程,甚至比虚拟内核更多的进程有意义,这在很大程度上取决于您在做什么。 The tighter the computational loop (little diversity in instructions, few cache misses, ...), the more likely you won't benefit from more used cores (by using more worker-processes) or even experience performance degradation.计算循环越紧密(指令的多样性越小,缓存未命中数,...),您就越有可能无法从更多使用的内核中受益(通过使用更多的工作进程),甚至无法体验到性能下降。

Obviously it also depends on what else your system is running, because your system tries to give every thread (as the actual execution unit of a process) in the system a fair share of run-time on the available cores.显然,它还取决于您的系统正在运行的其他内容,因为您的系统试图为系统中的每个线程(作为进程的实际执行单元)在可用内核上公平分配运行时间。 So there is no generalization possible in terms of how many workers you should use.因此,就您应该使用多少工人而言,不可能一概而论。 But if, for instance, you have a tight loop and your system is idling, a good starting point for optimizing is但是,例如,如果您有一个紧密的循环并且您的系统处于空闲状态,那么优化的一个好的起点是

os.cpu_count() // 2 # same as mp.cpu_count() // 2 

...and increasing from there. ...并从那里增加。

How @Frank Yellin already mentioned, multiprocessing.Pool uses os.cpu_count() for the number of workers as a default. @Frank Yellin 如何已经提到, multiprocessing.Pool使用os.cpu_count()作为默认工作人员数量。

os.sched_getaffinity(pid)

os.sched_getaffinity(pid)

Return the set of CPUs the process with PID pid (or the current process if zero) is restricted to.返回具有 PID pid 的进程(或当前进程,如果为零)被限制到的 CPU 集。

Now core/cpu/processor/-affinity is about on which concrete (virtual) cores your thread (within your worker-process) is allowed to run.现在 core/cpu/processor/-affinity 是关于允许您的线程(在您的工作进程内)在哪些具体(虚拟)内核上运行。 Your OS gives every core an id, from 0 to (number-of-cores - 1) and changing affinity allows restricting ("pinning") on which actual core(s) a certain thread is allowed to run at all.您的操作系统为每个核心提供了一个 id,从 0 到(核心数 - 1),并且更改关联性允许限制(“固定”)某个线程完全允许在哪些实际核心上运行。

At least on Linux I found this to mean that if none of the allowed cores is currently available, the thread of a child-process won't run, even if other, non-allowed cores would be idle.至少在 Linux 上,我发现这意味着如果当前没有任何允许的内核可用,即使其他不允许的内核空闲,子进程的线程也不会运行。 So "affinity" is a bit misleading here.所以“亲和力”在这里有点误导。

The goal when fiddling with affinity is to minimize cache invalidations from context-switches and core-migrations.摆弄亲和力的目标是最小化上下文切换和核心迁移导致的缓存失效。 Your OS here usually has the better insight and already tries to keep caches "hot" with its scheduling-policy, so unless you know what you're doing, you can't expect easy gains from interfering.您的操作系统通常具有更好的洞察力,并且已经尝试通过其调度策略保持缓存“热”,因此除非您知道自己在做什么,否则您不能指望从干扰中轻松获益。

By default the affinity is set to all cores and for multiprocessing.Pool , it doesn't make too much sense bothering with changing that, at least if your system is idle otherwise.默认情况下,亲和力设置为所有内核,对于multiprocessing.Pool ,更改它并没有太大意义,至少在您的系统空闲时如此。

Note that despite the fact the docs here speak of "process", setting affinity really is a per-thread thing.请注意,尽管这里的文档提到了“进程”,但设置亲和性确实是每个线程的事情。 So for example, setting affinity in a "child"-thread for the "current process if zero", does not change the affinity of the main-thread or other threads within the process.因此,例如,在“子线程”中为“当前进程如果为零”设置亲和性,不会改变主线程或进程内其他线程的亲和性。 But , child-threads inherit their affinity from the main-thread and child-processes (through their main-thread) inherit affinity from the parent's process main-thread.但是,子线程从主线程和子进程(通过它们的主线程)继承亲和性从父进程的主线程继承亲和性。 This affects all possible start-methods ("spawn", "fork", "forkserver").这会影响所有可能的启动方法(“spawn”、“fork”、“forkserver”)。 The example below demonstrates this and how to modify affinity with using multiprocessing.Pool .下面的示例演示了这一点以及如何使用multiprocessing.Pool修改关联。

import multiprocessing as mp
import threading
import os


def _location():
    return f"{mp.current_process().name} {threading.current_thread().name}"


def thread_foo():
    print(f"{_location()}, affinity before change: {os.sched_getaffinity(0)}")
    os.sched_setaffinity(0, {4})
    print(f"{_location()}, affinity after change: {os.sched_getaffinity(0)}")


def foo(_, iterations=200e6):

    print(f"{_location()}, affinity before thread_foo:"
          f" {os.sched_getaffinity(0)}")

    for _ in range(int(iterations)):  # some dummy computation
        pass

    t = threading.Thread(target=thread_foo)
    t.start()
    t.join()

    print(f"{_location()}, affinity before exit is unchanged: "
          f"{os.sched_getaffinity(0)}")

    return _


if __name__ == '__main__':

    mp.set_start_method("spawn")  # alternatives on Unix: "fork", "forkserver"

    # for current process, exclude cores 0,1 from affinity-mask
    print(f"parent affinity before change: {os.sched_getaffinity(0)}")
    excluded_cores = {0, 1}
    os.sched_setaffinity(0, os.sched_getaffinity(0).difference(excluded_cores))
    print(f"parent affinity after change: {os.sched_getaffinity(0)}")

    with mp.Pool(2) as pool:
        pool.map(foo, range(5))

Output:输出:

parent affinity before change: {0, 1, 2, 3, 4, 5, 6, 7}
parent affinity after change: {2, 3, 4, 5, 6, 7}
SpawnPoolWorker-1 MainThread, affinity before thread_foo: {2, 3, 4, 5, 6, 7}
SpawnPoolWorker-2 MainThread, affinity before thread_foo: {2, 3, 4, 5, 6, 7}
SpawnPoolWorker-1 Thread-1, affinity before change: {2, 3, 4, 5, 6, 7}
SpawnPoolWorker-1 Thread-1, affinity after change: {4}
SpawnPoolWorker-1 MainThread, affinity before exit is unchanged: {2, 3, 4, 5, 6, 7}
SpawnPoolWorker-1 MainThread, affinity before thread_foo: {2, 3, 4, 5, 6, 7}
SpawnPoolWorker-2 Thread-1, affinity before change: {2, 3, 4, 5, 6, 7}
SpawnPoolWorker-2 Thread-1, affinity after change: {4}
SpawnPoolWorker-2 MainThread, affinity before exit is unchanged: {2, 3, 4, 5, 6, 7}
SpawnPoolWorker-2 MainThread, affinity before thread_foo: {2, 3, 4, 5, 6, 7}
SpawnPoolWorker-2 Thread-2, affinity before change: {2, 3, 4, 5, 6, 7}
SpawnPoolWorker-2 Thread-2, affinity after change: {4}
SpawnPoolWorker-2 MainThread, affinity before exit is unchanged: {2, 3, 4, 5, 6, 7}
SpawnPoolWorker-2 MainThread, affinity before thread_foo: {2, 3, 4, 5, 6, 7}
SpawnPoolWorker-1 Thread-2, affinity before change: {2, 3, 4, 5, 6, 7}
SpawnPoolWorker-1 Thread-2, affinity after change: {4}
SpawnPoolWorker-1 MainThread, affinity before exit is unchanged: {2, 3, 4, 5, 6, 7}
SpawnPoolWorker-2 Thread-3, affinity before change: {2, 3, 4, 5, 6, 7}
SpawnPoolWorker-2 Thread-3, affinity after change: {4}
SpawnPoolWorker-2 MainThread, affinity before exit is unchanged: {2, 3, 4, 5, 6, 7}

If you had a tasks that were pure 100% CPU bound, ie did nothing but calculations, then clearly nothing would/could be gained by having a process pool size greater than the number of CPUs available on your computer.如果您有一个纯粹 100% CPU 限制的任务,即除了计算什么都不做,那么显然,如果进程池大小大于计算机上可用的 CPU 数量,则不会/无法获得任何好处。 But what if there was a mix of I/O thrown in whereby a process would relinquish the CPU waiting for an I/O to complete (or, for example, a URL to be returned from a website, which takes a relatively long time)?但是,如果存在混合的 I/O,进程将放弃等待 I/O 完成的 CPU(或者,例如,从网站返回的 URL,这需要相对较长的时间) ? To me it's not clear that you couldn't achieve in this scenario improved throughput with a process pool size that exceeds os.cpu_count() .对我来说,不清楚在这种情况下您无法通过超过os.cpu_count()的进程池大小来提高吞吐量。

Update更新

Here is code to demonstrate the point.这是演示这一点的代码。 This code, which would probably be best served by using threading, is using processes.这段代码使用线程可能是最好的服务,它正在使用进程。 I have 8 cores on my desktop.我的桌面上有 8 个内核。 The program simply retrieves 54 URL's concurrently (or in parallel in this case).该程序只是同时(或在这种情况下并行)检索 54 个 URL。 The program is passed an argument, the size of the pool to use.该程序传递了一个参数,即要使用的池的大小。 Unfortunately, there is initial overhead just to create additional processes so the savings begin to fall off if you create too many processes.不幸的是,创建额外的进程会产生初始开销,因此如果创建过多的进程,节省的成本就会开始下降。 But if the task were long running and had a lot of I/O, then the overhead of creating the processes would be worth it in the end:但是如果任务长时间运行并且有很多 I/O,那么创建进程的开销最终是值得的:

from concurrent.futures import ProcessPoolExecutor, as_completed
import requests
from timing import time_it

def get_url(url):
    resp = requests.get(url, headers={'user-agent': 'my-app/0.0.1'})
    return resp.text


@time_it
def main(poolsize):
    urls = [
        'https://ibm.com',
        'https://microsoft.com',
        'https://google.com',
        'https://ibm.com',
        'https://microsoft.com',
        'https://google.com',
        'https://ibm.com',
        'https://microsoft.com',
        'https://google.com',
        'https://ibm.com',
        'https://microsoft.com',
        'https://google.com',
        'https://ibm.com',
        'https://microsoft.com',
        'https://google.com',
        'https://ibm.com',
        'https://microsoft.com',
        'https://google.com',
        'https://ibm.com',
        'https://microsoft.com',
        'https://google.com',
        'https://ibm.com',
        'https://microsoft.com',
        'https://google.com',
        'https://ibm.com',
        'https://microsoft.com',
        'https://google.com',
        'https://ibm.com',
        'https://microsoft.com',
        'https://google.com',
        'https://ibm.com',
        'https://microsoft.com',
        'https://google.com',
        'https://ibm.com',
        'https://microsoft.com',
        'https://google.com',
        'https://ibm.com',
        'https://microsoft.com',
        'https://google.com',
        'https://ibm.com',
        'https://microsoft.com',
        'https://google.com',
        'https://ibm.com',
        'https://microsoft.com',
        'https://google.com',
        'https://ibm.com',
        'https://microsoft.com',
        'https://google.com',
        'https://ibm.com',
        'https://microsoft.com',
        'https://google.com',
        'https://ibm.com',
        'https://microsoft.com',
        'https://google.com',
    ]
    with ProcessPoolExecutor(poolsize) as executor:
        futures = {executor.submit(get_url, url): url for url in urls}
        for future in as_completed(futures):
            text = future.result()
            url = futures[future]
            print(url, text[0:80])
            print('-' * 100)

if __name__ == '__main__':
    import sys
    main(int(sys.argv[1]))

8 processes: (the number of cores I have): 8个进程:(我拥有的核心数):

func: main args: [(8,), {}] took: 2.316840410232544 sec.

16 processes: 16道工序:

func: main args: [(16,), {}] took: 1.7964842319488525 sec.

24 processes: 24道工序:

func: main args: [(24,), {}] took: 2.2560818195343018 sec.

The implementation of multiprocessing.pool uses multiprocessing.pool 的实现使用

        if processes is None:
            processes = os.cpu_count() or 1

Not sure if that answers your question, but at least it's a datapoint.不确定这是否能回答您的问题,但至少它是一个数据点。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM