简体   繁体   中英

A single Python script involving np.linalg.eig is inexplicably taking 128 CPUs?

Note: The problem seems to be related to np.linalg.eig and eigsh and scipy.sparse.linalg.eigsh . For scripts not involving these functions, everything on the AWS box works as expected.

The most basic script I have found with the problem is:

import numpy as np
for i in range(0, num_iter):
    x=np.linalg.eig(np.random.rand(1000,1000))

I'm having a very bizarre error on AWS where a basic python script that calculates eigenvalues is using 100% of 64 cores (and is going no faster because of it).

Objective: Run computationally intensive python code. The code is parallel for loop, where each iteration is independent. I have two versions of this code, a basic version without multiprocessing , and one using the multiprocessing module.

Problem: The virtual machine is a c6i.32xlarge box on AWS with 64 cores/128 threads.

  • On my personal machine, using 6 cores is roughly ~6 times faster when using the parallelized code. Using more than 1 core with the same code on the AWS box makes the runtime slower.

Inexplicable Part:

  • I tried to get around this by setting up multiple copies the basic script using & , and this doesn't work either. Running n copies causes them all to be slower by a factor of 1/n. Inexplicably, a single instance of the python script uses all the cores of the machine . Unix command TOP indicates 6400% of the CPUs being used (ie all of them), and AWS CPU usage monitoring confirms 100% usage of the machine. I don't see how this is possible given GIL.

Partial solution? Specifying the processor fixed the issue somewhat:

  • Running the commands taskset --cpu-list i my_python_script.py & for i from 1 to n, they do indeed run in parallel, and the time is independent of n (for small n). The expected CPU usage statistics on the AWS monitor are what you would expect . The speed here when using one processor was the same as when the script ran and was taking all the cores of the machine.

Note: The fact that the runtime on 1 processor is the same suggests it was really running on 1 core all along, and the others are somehow being erroneously used.

Question:

Why is my basic python script taking all 64 cores of the AWS machine while not going any faster? How is this error even possible? And how can I get it to run simply with multiprocessing without using this weird taskset --cpu-list work around?

I had the exact same problem on the Google Cloud Platform as well.

The basic script is very simple:

from my_module import my_np_and_scipy_function
from my_other_module import input_function

if __name__ == "__main__":
    output = []
    for i in range(0, num_iter):
        result = my_np_and_scipy_function(kwds, param = input_function)
        output.extend(result)

With multiprocessing , it is:

from my_module import my_np_and_scipy_function

if __name__ == "__main__":

    pool = multiprocessing.Pool(cpu_count)
    for i in range(0, num_iter):
        result = pool.apply_async(my_np_and_scipy_function,kwds={"param":input_function,...},
        )
        results.append(result)

    output = []
    for x in results:
        output.extend(x.get())

Numpy use multiprocessing in some random functions. So it is possible. You can see here https://github.com/numpy/numpy/search?q=multiprocessing

Following the answers in the post, Limit number of threads in numpy , the numpy eig functions and the scripts work properly by putting the following lines of code at the top of the script:

import os

os.environ["MKL_NUM_THREADS"] = "1"
os.environ["NUMEXPR_NUM_THREADS"] = "1"
os.environ["OMP_NUM_THREADS"] = "1"

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM