简体   繁体   English

Mac M1 上的 Numpy 异常缓慢

[英]Numpy on Mac M1 abnormally slow

I have conducted a simple speed test for my numpy:我对我的 numpy 进行了简单的速度测试:

import numpy as np

A = np.random.rand(1000, 1000)
B = np.random.rand(1000, 1000)

%timeit A.dot(B)

The result is:结果是:

30.3 ms ± 829 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

This result seems abnormally slow compared with what others typically see (less than 10 ms on average).与其他人通常看到的结果相比,这个结果似乎异常缓慢(平均不到 10 毫秒)。 I'm wondering what could possibly be the cause of such behavior.我想知道这种行为的原因可能是什么。

My system is MacOS Big Sur on M1 chip.我的系统是 M1 芯片上的 MacOS Big Sur。 Python version is 3.8.13, numpy version is 1.22.4. Python版本是3.8.13,numpy版本是1.22.4。 The numpy is installed via numpy 通过安装

pip install "numpy==1.22.4"

The output of np.show_config() is: np.show_config()的 output 是:

openblas64__info:
    libraries = ['openblas64_', 'openblas64_']
    library_dirs = ['/usr/local/lib']
    language = c
    define_macros = [('HAVE_CBLAS', None), ('BLAS_SYMBOL_SUFFIX', '64_'), ('HAVE_BLAS_ILP64', None)]
    runtime_library_dirs = ['/usr/local/lib']
blas_ilp64_opt_info:
    libraries = ['openblas64_', 'openblas64_']
    library_dirs = ['/usr/local/lib']
    language = c
    define_macros = [('HAVE_CBLAS', None), ('BLAS_SYMBOL_SUFFIX', '64_'), ('HAVE_BLAS_ILP64', None)]
    runtime_library_dirs = ['/usr/local/lib']
openblas64__lapack_info:
    libraries = ['openblas64_', 'openblas64_']
    library_dirs = ['/usr/local/lib']
    language = c
    define_macros = [('HAVE_CBLAS', None), ('BLAS_SYMBOL_SUFFIX', '64_'), ('HAVE_BLAS_ILP64', None), ('HAVE_LAPACKE', None)]
    runtime_library_dirs = ['/usr/local/lib']
lapack_ilp64_opt_info:
    libraries = ['openblas64_', 'openblas64_']
    library_dirs = ['/usr/local/lib']
    language = c
    define_macros = [('HAVE_CBLAS', None), ('BLAS_SYMBOL_SUFFIX', '64_'), ('HAVE_BLAS_ILP64', None), ('HAVE_LAPACKE', None)]
    runtime_library_dirs = ['/usr/local/lib']
Supported SIMD extensions in this NumPy install:
    baseline = SSE,SSE2,SSE3
    found = SSSE3,SSE41,POPCNT,SSE42
    not found = AVX,F16C,FMA3,AVX2,AVX512F,AVX512CD,AVX512_KNL,AVX512_SKX,AVX512_CLX,AVX512_CNL,AVX512_ICL

Edit:编辑:

I did another test with this code snippet (from 1 ):我用这个代码片段做了另一个测试(来自1 ):

import time
import numpy as np
np.random.seed(42)
a = np.random.uniform(size=(300, 300))
runtimes = 10

timecosts = []
for _ in range(runtimes):
    s_time = time.time()
    for i in range(100):
        a += 1
        np.linalg.svd(a)
    timecosts.append(time.time() - s_time)

print(f'mean of {runtimes} runs: {np.mean(timecosts):.5f}s')

The result of my test is:我的测试结果是:

mean of 10 runs: 6.17438s

whereas the reference results on the website 1 are: (the chip is M1 Max)而网站1上的参考结果是:(芯片是M1 Max)

+-----------------------------------+-----------------------+--------------------+
|   Python installed by (run on)→   | Miniforge (native M1) | Anaconda (Rosseta) |
+----------------------+------------+------------+----------+----------+---------+
| Numpy installed by ↓ | Run from → |  Terminal  |  PyCharm | Terminal | PyCharm |
+----------------------+------------+------------+----------+----------+---------+
|          Apple Tensorflow         |   4.19151  |  4.86248 |     /    |    /    |
+-----------------------------------+------------+----------+----------+---------+
|        conda install numpy        |   4.29386  |  4.98370 |  4.10029 | 4.99271 |
+-----------------------------------+------------+----------+----------+---------+

From the results, the timing of my code is slower compared with any of the numpy versions in the reference.从结果来看,与参考中的任何 numpy 版本相比,我的代码的计时都比较慢。

I've noticed similar slowdowns on M1, but I think the actual cause, at least on my computer, is not a fundamentally faulty Numpy installation, but some problem with the benchmarks themselves.我注意到 M1 上有类似的减速,但我认为真正的原因,至少在我的电脑上,不是根本性的错误 Numpy 安装,而是基准测试本身的一些问题。 Consider the following example:考虑以下示例:

In [25]: from scipy import linalg

In [26]: a = np.random.randn(1000,100)

In [27]: %timeit a.T @ a
226 µs ± 7.03 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

In [28]: x = a.T @ a

In [29]: %timeit linalg.eigh(x)
1.69 ms ± 88.8 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

In [30]: %timeit linalg.eigh(a.T @ a)
428 ms ± 99.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Computing x = aT @ a; eigh(x)计算x = aT @ a; eigh(x) x = aT @ a; eigh(x) takes 2 ms, while eigh(aT @ a) 400 ms. x = aT @ a; eigh(x)需要 2 毫秒,而eigh(aT @ a)需要 400 毫秒。 I think in the latter case it's some problem with %timeit .我认为在后一种情况下, %timeit有一些问题。 Maybe for some reason the computation gets routed to "efficiency cores"?也许出于某种原因,计算被路由到“效率核心”?

My tentative answer is that your first benchmark with %timeit is not reliable.我的初步回答是,您使用%timeit的第一个基准测试不可靠。

If you suspect an issue in timeit, try using time instead如果您怀疑 timeit 有问题,请尝试改用 time

import time
start = time.time()

# your numpy test here

took=time.time() - start
print("Test took "+str(took)+" seconds.")

For more information on numpy on Apple silicon, please read the first answer in the link bellow.有关 numpy on Apple silicon 的更多信息,请阅读以下链接中的第一个答案。 For optimal performance, it is advised to use Apple's accelerated vecLib.为获得最佳性能,建议使用 Apple 的加速 vecLib。 If you install using conda, then check out also @AndrejHribernik's comment: Why Python native on M1 Max is greatly slower than Python on old Intel i5?如果您使用 conda 安装,请查看@AndrejHribernik 的评论: 为什么 M1 Max 上的 Python native 比旧的 Intel i5 上的 Python 慢得多?

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM