2-CPU Xeon服务器上的RAM使用导致Python 3性能下降

Question

System in question is a 2-CPU Xeon server running CentOS with 256 GB RAM: 有问题的系统是运行CentOS且具有256 GB RAM的2-CPU Xeon服务器：

2 x Intel(R) Xeon(R) CPU E5-2687W 0 @ 3.10GHz 2个Intel（R）Xeon（R）CPU E5-2687W 0 @ 3.10GHz

Each CPU has 8 cores, so with hyperthreading the system has 32 processors show up in /proc/cpuinfo. 每个CPU有8个核心，因此，通过超线程，系统在/ proc / cpuinfo中显示了32个处理器。

In using this sytem, I noticed some peculiar performance issues in some data processing. 在使用此系统时，我注意到某些数据处理中存在一些特殊的性能问题。 The data processing system is built in Python 3.3.5 (environment setup with Anaconda) and spawns a bunch of processes that read data from a file, create some numpy arrays, and do some processing. 数据处理系统是在Python 3.3.5（Anaconda的环境设置）中构建的，并产生了许多进程，这些进程从文件中读取数据，创建一些numpy数组，并进行一些处理。

I was testing the processing with various numbers of processes spawned. 我正在用各种数量的进程测试该进程。 Up to a certain number of processes, performance stayed relatively constant. 在一定数量的过程中，性能保持相对恒定。 However, once I got up to 16 processes, I noticed that a numpy.abs() call started taking around 10 times longer than it otherwise should, from around 2 seconds to 20 or more seconds. 但是，一旦我有多达16个进程，我注意到numpy.abs（）调用开始的时间比其他时间长大约10倍，从2秒到20秒或更长时间。

Now, total memory usage in this test was not a problem. 现在，此测试中的总内存使用已不是问题。 Of the 256 GB system RAM, htop showed 100+ GB free, and meminfo wasn't showing any swapping. 在256 GB的系统RAM中，htop显示了100+ GB的可用空间，而meminfo没有显示任何交换。

I ran another test of using 16 processes, but loading less data with total memory use around 75 GB. 我对使用16个进程进行了另一项测试，但加载的数据较少，而总内存使用量约为75 GB。 In this case, the numpy.abs() call was taking 1 second (which is expected since it's half the data). 在这种情况下，numpy.abs（）调用花费了1秒（这是预期的，因为它是数据的一半）。 Going to 24 processes, still using less than half of the system ram, the numpy.abs() call likewise took around 1 second. 进入24个进程，仍然使用不到系统内存的一半，numpy.abs（）调用同样花费了大约1秒。 So I was no longer seeing the 10x performance hit. 因此，我不再看到10倍的性能下降。

The interesting thing here is that it does seem like if more than half the system memory is used, performance degrades terribly. 有趣的是，如果使用了一半以上的系统内存，性能似乎会急剧下降。 It doesn't seem like this should be the case, but I have no other explanation. 似乎并非应该如此，但是我没有其他解释。

I wrote a Python script that sort of simulates what the processing framework does. 我编写了一个Python脚本，该脚本可以模拟处理框架的功能。 I've tried various methods of spawining processes, multiprocessing.Pool apply_async(), concurrent.futures, and multiprocessing Process, and they all give the same results. 我尝试了各种方法来生成进程，multiprocessing.Pool apply_async（），current.futures和multiprocessing Process，它们都给出相同的结果。

import pdb
import os
import sys
import time
import argparse
import numpy
import multiprocessing as mp

def worker(n):
    print("Running worker", n)

    NX = 20000
    NY = 10000

    time_start = time.time()

    x1r = numpy.random.rand(NX,NY)
    x1i = numpy.random.rand(NX,NY)
    x1 = x1r + 1j * x1i
    x1a = numpy.abs(x1)

    print(time.time() - time_start)

def proc_file(nproc):
    procs = {}

    for i in range(0,nproc):
        procs[i] = mp.Process(target = worker, args = (i, ))
        procs[i].start()

    for i in range(0,nproc):
        procs[i].join()

if __name__ == "__main__":
    time_start = time.time()

    DEFAULT_NUM_PROCS = 8

    ap = argparse.ArgumentParser()

    ap.add_argument('-nproc', default = DEFAULT_NUM_PROCS, type = int,
                    help = "Number of cores to run in parallel, default = %d" \
                            % DEFAULT_NUM_PROCS)

    opts = ap.parse_args()

    nproc = opts.nproc

    # spawn processes
    proc_file(nproc)

    time_end = time.time()

    print('Done in', time_end - time_start, 's')

Some results for various number of processes: 各种流程的一些结果：

$ python test_multiproc_2.py -nproc 4
Running worker 0
Running worker 1
Running worker 2
Running worker 3
12.1790452003479
12.180120944976807
12.191224336624146
12.205029010772705
Done in 12.22369933128357 s

$ python test_multiproc_2.py -nproc 8
Running worker 0
Running worker 1
Running worker 2
Running worker 3
Running worker 4
Running worker 5
Running worker 6
Running worker 7
12.685678720474243
12.692482948303223
12.704699039459229
13.247581243515015
13.253047227859497
13.261905670166016
13.29712200164795
13.458561897277832
Done in 13.478906154632568 s

$ python test_multiproc_2.py -nproc 16
Running worker 0
Running worker 1
Running worker 2
Running worker 3
Running worker 4
Running worker 5
Running worker 6
Running worker 7
Running worker 8
Running worker 9
Running worker 10
Running worker 11
Running worker 12
Running worker 13
Running worker 14
Running worker 15
135.4193136692047
145.7047221660614
145.99714827537537
146.088121175766
146.3116044998169
146.94093680381775
147.05147790908813
147.4889578819275
147.8443088531494
147.92090320587158
148.32112169265747
148.35854578018188
149.11916518211365
149.22325253486633
149.45888781547546
149.74489760398865
Done in 149.97473335266113 s

So, 4 and 8 processes are about the same, but with 16 processes it is 10 times slower! 因此，4个和8个进程大致相同，但是16个进程慢了10倍！ The noticeable thing is with the 16 process case, memory usage hits 146 GB. 值得注意的是，在16个处理案例中，内存使用达到146 GB。

If I reduce the size of the numpy array in half and run it again: 如果我将numpy数组的大小减半并再次运行：

$ python test_multiproc_2.py -nproc 4
Running worker 1
Running worker 0
Running worker 2
Running worker 3
5.926755666732788
5.93787956237793
5.949704885482788
5.955750226974487
Done in 5.970340967178345 s

$ python test_multiproc_2.py -nproc 16
Running worker 1
Running worker 3
Running worker 0
Running worker 2
Running worker 5
Running worker 4
Running worker 7
Running worker 8
Running worker 6
Running worker 11
Running worker 9
Running worker 10
Running worker 13
Running worker 12
Running worker 14
Running worker 15
7.728739023208618
7.751606225967407
7.754587173461914
7.760802984237671
7.780809164047241
7.802706241607666
7.852390766143799
7.8615334033966064
7.876686096191406
7.891174793243408
7.916942834854126
7.9261558055877686
7.947092771530151
7.967057704925537
8.012752294540405
8.119316577911377
Done in 8.135530233383179 s

So, a little bit of a performance hit between 16 and 4 processes, but nothing close to what is being seen with the larger array. 因此，在16到4个进程之间会降低性能，但与大型阵列所看到的性能几乎没有什么不同。

Also, if I double the array size and run it again: 另外，如果我将数组大小加倍并再次运行：

$ python test_multiproc_2.py -nproc 4
Running worker 1
Running worker 0
Running worker 2
Running worker 3
23.567795515060425
23.747386693954468
23.76904606819153
23.781703233718872
Done in 23.83848261833191 s

$ python test_multiproc_2.py -nproc 8
Running worker 1
Running worker 0
Running worker 3
Running worker 2
Running worker 5
Running worker 4
Running worker 6
Running worker 7
103.20905923843384
103.52968168258667
103.62282609939575
103.62272334098816
103.77079129219055
103.77456998825073
103.86126565933228
103.87058663368225
Done in 104.26257705688477 s

With 8 processes now, RAM use hits 145 GB, and there's a 5X performance hit. 现在有8个进程，RAM使用达到145 GB，性能下降了5倍。

I don't know what to make of this. 我不知道该怎么做。 The system becomes basically unusable if more than half of the system memory is being used. 如果使用了一半以上的系统内存，则该系统基本上将变得不可用。 But, I don't know if that's just coincidence and something else is to blame. 但是，我不知道这是否只是巧合，还有其他原因。

Is this a Python thing? 这是Python吗？ Or a system architecture thing? 还是系统架构的东西？ Does each physical CPU only play well with half the system memory? 每个物理CPU是否只能在一半的系统内存上正常运行？ Or is it a memory bandwidth issue? 还是内存带宽问题？ What else can I do to try to figure this out? 我还能做些什么来解决这个问题？

Answer 1

这是使用垃圾回收的语言的问题：如果您太接近最大RAM，则它们会一直尝试运行GC，从而导致CPU使用率增加。

Answer 2

The only thing that resolves the problem is clearing the cached memory. 解决该问题的唯一方法是清除缓存的内存。 I ran a test that needed just about all 256 GB memory when the OS was using about 200 GB for cache. 当操作系统使用大约200 GB的缓存时，我进行了一项测试，几乎需要所有256 GB的内存。 It took forever and started crapping out once the OS started freeing cache. 一旦操作系统开始释放缓存，它就花了很长时间，并且开始崩溃。 After this test ran 'free -m' showed only 3 GB of cached memory. 运行此测试后，“ free -m”显示仅3 GB的缓存内存。 I ran the same benchmark and it ran in the expected amount of time with no CPU craziness that was seen before. 我运行了相同的基准测试，并且运行了预期的时间，没有出现以前的CPU疯狂现象。 Performance stayed constant over repeated runs. 在反复运行后，性能保持恒定。

So, contrary to what I read online that OS memory cache does not affect application performance, my experience very much tells me that it does, at least in this particular usage case. 因此，与我在网上阅读的内容相反，OS内存缓存不会影响应用程序性能，我的经验告诉我，至少在这种特定使用情况下，它确实会影响应用程序性能。

2-CPU Xeon服务器上的RAM使用导致Python 3性能下降

问题描述

2 个解决方案

解决方案1
0 2017-06-28 18:39:38

解决方案2
0 2017-07-31 17:27:56

2-CPU Xeon服务器上的RAM使用导致Python 3性能下降

问题描述

2 个解决方案

解决方案1 0 2017-06-28 18:39:38

解决方案2 0 2017-07-31 17:27:56

解决方案1
0 2017-06-28 18:39:38

解决方案2
0 2017-07-31 17:27:56