简体   繁体   English

SLURM Python 脚本在循环中累积 memory

[英]SLURM Python Script accumulating memory in loop

I am running a simple python script on SLURM scheduler for HPC.我在 HPC 的 SLURM 调度程序上运行一个简单的 python 脚本。 It reads in a data set (approximately 6GB) and plots and saves images of parts of the data.它读入一个数据集(大约 6GB)并绘制和保存部分数据的图像。 There are several of these data files so I use a loop to iterate until I finish plotting data from each file.这些数据文件中有几个,因此我使用循环进行迭代,直到完成每个文件中的数据绘制。

For some reason, however, there is a memory usage increase in each loop.但是,由于某种原因,每个循环中的 memory 使用量都会增加。 I've mapped my variables using the getsizeof() but they don't seem to change over iterations.我已经使用 getsizeof() 映射了我的变量,但它们似乎不会随着迭代而改变。 So I'm not sure where this memory "leak" could be coming from.所以我不确定这个 memory “泄漏”可能来自哪里。

Here's my script:这是我的脚本:

import os, psutil
import sdf_helper as sh
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.ticker as plticker
plt.rcParams['figure.figsize'] = [6, 4]
plt.rcParams['figure.dpi'] = 120 # 200 e.g. is really fine, but slower
from sys import getsizeof


for i in range(5,372):
    plt.clf()   
    fig, ax = plt.subplots()
    #dd gets data using the epoch specific SDF file reader sh.getdata
    dd = sh.getdata(i,'/dfs6/pub/user');
    #extract density data as 2D array
    den = dd.Derived_Number_Density_electron.data.T;
    nmin = np.min(dd.Derived_Number_Density_electron.data[np.nonzero(dd.Derived_Number_Density_electron.data)])
    #extract grid points as 2D array
    xy = dd.Derived_Number_Density_electron.grid.data
    #extract single number time
    time = dd.Header.get('time')
    #free up memory from dd
    dd = None
    #plotting
    plt.pcolormesh(xy[0], xy[1],np.log10(den), vmin = 20, vmax = 30)
    cbar = plt.colorbar()
    cbar.set_label('Density in log10($m^{-3}$)')
    plt.title("time:   %1.3e s \n Min e- density:   %1.2e $m^{-3}$" %(time,nmin))
    ax.set_facecolor('black')
    plt.savefig('D00%i.png'%i, bbox_inches='tight')
    print("dd: ", getsizeof(dd))
    print("den: ",getsizeof(den))
    print("nmin: ",getsizeof(nmin))
    print("xy: ",getsizeof(xy))
    print("time: ",getsizeof(time))
    print("fig: ",getsizeof(fig))
    print("ax: ",getsizeof(ax))
    process = psutil.Process(os.getpid())
    print(process.memory_info().rss)

output output

Reading file /dfs6/pub/user/0005.sdf
dd:  16
den:  112
nmin:  32
xy:  56
time:  24
fig:  48
ax:  48
8991707136

Reading file /dfs6/pub/user0006.sdf
dd:  16
den:  112
nmin:  32
xy:  56
time:  24
fig:  48
ax:  48
13814497280

Reading file /dfs6/pub/user/0007.sdf
dd:  16
den:  112
nmin:  32
xy:  56
time:  24
fig:  48
ax:  48
18648313856

SLURM Input SLURM 输入

#!/bin/bash

#SBATCH -p free
#SBATCH --job-name=epochpyd1
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --mem-per-cpu=20000


#SBATCH --mail-type=begin,end
#SBATCH --mail-user=**

module purge
module load python/3.8.0

python3 -u /data/homezvol0/user/CNTDensity.py > density.out

SLURM output SLURM output

/data/homezvol0/user/CNTDensity.py:21: RuntimeWarning: divide by zero encountered in log10
  plt.pcolormesh(xy[0], xy[1],np.log10(den), vmin = 20, vmax = 30)
/export/spool/slurm/slurmd.spool/job1910549/slurm_script: line 16:  8004 Killed                  python3 -u /data/homezvol0/user/CNTDensity.py > density.out
slurmstepd: error: Detected 1 oom-kill event(s) in step 1910549.batch cgroup. Some of your processes may have been killed by the cgroup out-of-memory handler.

As far as I can tell everything seems to be working.据我所知,一切似乎都在工作。 Not sure what could be taking up more than 20GB of memory.不确定什么会占用超过 20GB 的 memory。

EDIT So I began commenting out sections of the loop from the bottom up.编辑所以我开始从下往上注释掉循环的部分。 It's now clear that pcolormesh is the culprit.现在很明显 pcolormesh 是罪魁祸首。

I've added ( Closing pyplot windows ):我已经添加( 关闭 pyplot windows ):

fig.clear()
plt.clf()
plt.close('all')
fig = None
ax = None
del fig
del ax

To the end but the memory keeps climbing no matter what.到最后,memory 无论如何都会继续攀升。 I'm at a total loss at what's happening.我对正在发生的事情完全不知所措。

You're on the right track, having made it visible how much the memory accumulates on each iteration.您在正确的轨道上,已经让 memory 在每次迭代中累积多少可见。 The next step in debugging is to think of hypotheses for where that memory could be accumulating and ways to test those hypotheses.调试的下一步是考虑 memory 可能在哪里累积的假设以及测试这些假设的方法。

One hold onto memory after each iteration are the variables like den .每次迭代后,一个持有 memory 的变量是den之类的变量。 You can rule out those hypotheses (and thus narrow in on the problem) by clearing those variables as the code does via dd = None , or deleting them via del dd , or moving portions of the loop body into subroutines so some variables go away when those subroutines return.您可以排除这些假设(从而缩小问题范围),方法是通过dd = None清除这些变量,或者通过del dd删除它们,或者将循环体的部分移动到子例程中,以便一些变量 go 在那些子程序返回。 (And factoring out subroutines can also make those parts more reusable and easier to test.) This technique will rule out some possible causes of the problem but I don't expect these variable assignments to accumulate memory over iterations, which it would if the code added data to a dict or a list on each iteration. (并且分解出子例程也可以使这些部分更可重用且更易于测试。)这种技术将排除问题的一些可能原因,但我不希望这些变量分配在迭代中累积memory,如果代码在每次迭代时将数据添加到dictlist中。

Another hypothesis is that could be state accumulating in matplotlib that doesn't get cleared by plt.clf() or state accumulating in sdf_helper .另一个假设是,可能是 state 累积在matplotlib中,而plt.clf()或 state 累积在sdf_helper中没有被清除I don't know enough about these libraries to provide direct insight but their documentation should say how to clear out state.我对这些库的了解不足以提供直接的见解,但他们的文档应该说明如何清除 state。 Even without knowing how to clear their state, we can think of ways to test these hypotheses.即使不知道如何清除他们的 state,我们也可以想办法检验这些假设。 Eg comment out the plt calls or at least the data-intensive calls, then see if the memory still accumulates.例如,注释掉plt调用或至少是数据密集型调用,然后查看 memory 是否仍在累积。

You might think of more hypotheses than I did.你可能会想出比我更多的假设。 Brainstorming hypotheses first is a good approach since one of them might be an obviously best candidate, or one of them might be a lot easier to test than the others.首先头脑风暴假设是一种很好的方法,因为其中一个可能是明显的最佳候选者,或者其中一个可能比其他更容易测试。

Beware that there could be multiple causes of accumulating memory, in which case fixing one cause will reduce the memory accumulation but won't fix it.请注意,累积 memory 可能有多种原因,在这种情况下,修复一个原因将减少 memory 累积,但不会修复它。 Since you're measuring the memory accumulation, you'll be able to detect this.由于您正在测量 memory 累积,因此您将能够检测到这一点。 In many debugging situations, we can't see the incremental contributions of multiple causes to a problem such as flakey results, so an alternate technique is to cut out everything that might be causing the problem, then add them back one at a time.在许多调试情况下,我们看不到多个原因对问题的增量贡献,例如 flakey 结果,因此另一种技术是删除可能导致问题的所有内容,然后一次添加一个。

Additions加法

Now that you've narrowed the problem to pcolormesh , the next step is reading the docs or tutorials on how matplotlib and pcolormesh use memory.现在您已将问题缩小到pcolormesh ,下一步是阅读有关 matplotlib 和pcolormesh如何使用 memory 的文档或教程。 Also, a web search for pcolormesh memory leak finds specific tips on this.此外,web 搜索pcolormesh memory leak可以找到关于此的特定提示。

The easiest thing to try is to add a call to ax.cla() to clear the axes, as in this example .最简单的尝试是添加对ax.cla()的调用以清除轴,如本例所示

You could switch from pyplot to matplotlib's object-oriented interface which doesn't retain as much if any global state.您可以从 pyplot 切换到 matplotlib 的面向对象接口,如果有任何全局 state,它就不会保留那么多。 In contrast, I think pyplot retains the fig and ax , in which case releasing your variables isn't enough to release their objects.相比之下,我认为pyplot保留了figax ,在这种情况下释放变量不足以释放它们的对象。

Apparently imshow uses less memory and time than pcolormesh, assuming your data fits on a rectangular grid .假设您的数据适合矩形网格,显然imshow使用的 memory 和时间比 pcolormesh 少

Note Issue like #1741 which recommends creating a pcolormesh just once, then setting its data in each loop iteration -- can you do mesh = plt.pcolormesh(...) once, then something like mesh.set_array(np.log10(den)) in each iteration?注意像 #1741 这样的问题,它建议只创建一次pcolormesh ,然后在每次循环迭代中设置它的数据——你能做一次mesh = plt.pcolormesh(...) ,然后像mesh.set_array(np.log10(den))在每次迭代中? It also recommends calling cla() .它还建议调用cla()

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM