简体   繁体   English

如何修复 Django 测试中的 memory 泄漏?

[英]How do you fix a memory leak within Django tests?

Recently I started having some problems with Django (3.1) tests, which I finally tracked down to some kind of memory leak.最近我开始在 Django (3.1) 测试中遇到一些问题,我终于找到了某种 memory 泄漏。 I normally run my suite (roughly 4000 tests at the moment) with --parallel=4 which results in a high memory watermark of roughly 3GB (starting from 500MB or so).我通常使用--parallel=4运行我的套件(目前大约 4000 个测试),这会导致大约 3GB 的高 memory 水印(从 500MB 左右开始)。 For auditing purposes, though, I occasionally run it with --parallel=1 - when I do this, the memory usage keeps increasing, ending up over the VM's allocated 6GB.不过,出于审计目的,我偶尔会使用--parallel=1运行它——当我这样做时,memory 的使用量不断增加,最终超过了 VM 分配的 6GB。

I spent some time looking at the data and it became clear that the culprit is, somehow, Webtest - more specifically, its response.html and response.forms : each call during the test case might allocate a few MBs (two or three, generally) which don't get released at the end of the test method and, more importantly, not even at the end of the TestCase .我花了一些时间查看数据,很明显,罪魁祸首不知何故是 Webtest - 更具体地说,它的response.htmlresponse.forms :测试用例期间的每个调用可能会分配几个 MB(通常是两个或三个) 不会在测试方法结束时发布,更重要的是,甚至在TestCase结束时也不会发布。

I've tried everything I could think of - gc.collect() with gc.DEBUG_LEAK shows me a whole lot of collectable items, but it frees no memory at all;我已经尝试了所有我能想到的东西——带有gc.DEBUG_LEAKgc.collect()向我展示了很多可收藏的物品,但它根本没有释放 memory; using delattr() on various TestCase and TestResponse attributes and so on resulted in no change at all, etc.在各种TestCaseTestResponse属性上使用delattr()等根本没有变化,等等。

I'm quite literally at my wits' end, so any pointer to solve this (beside editing the thousand or so tests which use WebTest responses, which is really not feasible) would be very much appreciated.我真的是束手无策,因此任何解决此问题的指针(除了编辑使用 WebTest 响应的数千个左右的测试,这实际上是不可行的)将不胜感激。

(please note that I also tried using guppy and tracemalloc and memory_profiler but neither gave me any kind of actionable information.) (请注意,我也尝试过使用guppytracemallocmemory_profiler ,但都没有给我任何可操作的信息。)


Update更新

I found that one of our EC2 testing instances isn't affected by the problem, so I spent some more time trying to figure this out.我发现我们的其中一个 EC2 测试实例不受该问题的影响,因此我花了一些时间试图解决这个问题。 Initially, I tried to find the "sensible" potential causes - for instance, the cached template loader, which was enabled on my local VM and disabled on the EC2 instance - without success.最初,我试图找到“合理”的潜在原因——例如,缓存模板加载器,它在我的本地 VM 上启用并在 EC2 实例上禁用——但没有成功。 Then I went all in: I replicated the EC2 virtualenv (with pip freeze ) and the settings (copying the dotenv), and checked out the same commit where the tests were running normally on the EC2.然后我全力以赴:我复制了 EC2 virtualenv(使用pip freeze )和设置(复制 dotenv),并检查了测试在 EC2 上正常运行的相同提交。

Et voilà!瞧! THE MEMORY LEAK IS STILL THERE ! MEMORY 泄漏仍然存在

Now, I'm officially giving up and will use --parallel=2 for future tests until some absolute guru can point me in the right directions.现在,我正式放弃并将使用--parallel=2进行未来的测试,直到某个绝对的大师可以为我指出正确的方向。


Second update第二次更新

And now the memory leak is there even with --parallel=2 .现在 memory 即使使用--parallel=2也存在泄漏。 I guess that's somehow better, since it looks increasingly like it's a system problem rather than an application problem.我想这在某种程度上更好,因为它看起来越来越像是系统问题而不是应用程序问题。 Doesn't solve it but at least I know it's not my fault.没有解决它,但至少我知道这不是我的错。


Third update第三次更新

Thanks to Tim Boddy's reply to this question I tried using chap to figure out what's making memory grow.感谢 Tim Boddy 对这个问题的回复,我尝试使用chap找出是什么让 memory 增长。 Unfortunately I can't "read" the results properly but it looks like some non-python library is actually causing the problem.不幸的是,我无法正确“读取”结果,但看起来一些非 python 库实际上是导致问题的原因。 So, this is what I've seen analyzing the core after a few minutes running the tests that I know cause the leak:所以,这就是我在运行我知道会导致泄漏的测试几分钟后分析核心的结果:

chap> summarize writable
49 ranges take 0x1e0aa000 bytes for use: unknown
1188 ranges take 0x12900000 bytes for use: python arena
1 ranges take 0x4d1c000 bytes for use: libc malloc main arena pages
7 ranges take 0x3021000 bytes for use: stack
139 ranges take 0x476000 bytes for use: used by module
1384 writable ranges use 0x38b5d000 (951,439,360) bytes.
chap> count used
3144197 allocations use 0x14191ac8 (337,189,576) bytes.

The interesting point is that the non-leaking EC2 instance shows pretty much the same values as the one I get from count used - which would suggest that those "unknown" ranges are the actual hogs.有趣的一点是,不泄漏的 EC2 实例显示的值与我从使用的count used得到的值几乎相同——这表明那些“未知”范围是实际的猪。 This is also supported by the output of summarize used (showing first few lines): summarize used也支持这一点(显示前几行):

Unrecognized allocations have 886033 instances taking 0x8b9ea38(146,401,848) bytes.
   Unrecognized allocations of size 0x130 have 148679 instances taking 0x2b1ac50(45,198,416) bytes.
   Unrecognized allocations of size 0x40 have 312166 instances taking 0x130d980(19,978,624) bytes.
   Unrecognized allocations of size 0xb0 have 73886 instances taking 0xc66ca0(13,003,936) bytes.
   Unrecognized allocations of size 0x8a8 have 3584 instances taking 0x793000(7,942,144) bytes.
   Unrecognized allocations of size 0x30 have 149149 instances taking 0x6d3d70(7,159,152) bytes.
   Unrecognized allocations of size 0x248 have 10137 instances taking 0x5a5508(5,920,008) bytes.
   Unrecognized allocations of size 0x500018 have 1 instances taking 0x500018(5,242,904) bytes.
   Unrecognized allocations of size 0x50 have 44213 instances taking 0x35f890(3,537,040) bytes.
   Unrecognized allocations of size 0x458 have 2969 instances taking 0x326098(3,301,528) bytes.
   Unrecognized allocations of size 0x205968 have 1 instances taking 0x205968(2,120,040) bytes.

The size of those single-instance allocations is very similar to the kind of deltas I see if I add calls to resource.getrusage(resource.RUSAGE_SELF).ru_maxrss in my test runner when starting/stopping tests - but they're not recognized as Python allocations, hence my feeling.这些单实例分配的大小与我看到的增量类型非常相似,如果我在开始/停止测试时在我的测试运行程序中添加对resource.getrusage(resource.RUSAGE_SELF).ru_maxrss的调用 - 但它们不被识别为Python 分配,因此我的感觉。

First of all, a huge apology: I was mistaken in thinking WebTest was the cause of this, and the reason was indeed in my own code, rather than libraries or anything else.首先,一个巨大的道歉:我错误地认为 WebTest 是造成这种情况的原因,而原因确实是在我自己的代码中,而不是库或其他任何东西。

The real cause was a mixin class where I, unthinkingly, added a dict as class attribute, like真正的原因是一个混合 class 我不假思索地添加了一个字典作为 class 属性,比如

class MyMixin:
    errors = dict()

Since this mixin is used in a few forms, and the tests generate a fair amout of form errors (that are added to the dict), this ended up hogging memory.由于这个 mixin 用于一些 forms,并且测试会产生相当多的表单错误(添加到 dict 中),这最终会占用 memory。

While this is not very interesting in itself, there are a few takeaways that may be helpful to future explorers who stumble across the same kind of problem.虽然这本身并不是很有趣,但有一些要点可能对未来遇到同样问题的探索者有所帮助。 They might all be obvious to everybody except me and a single other developer - in which case, hello other developer.除了我和其他开发人员之外,他们可能对所有人都很明显——在这种情况下,你好其他开发人员。

  1. The reason why the same commit had different behaviors on the EC2 machine and my own VM is that the branch in the remote machine hadn't been merged yet, so the commit that introduced the leak wasn't there poisoning the environment.相同的提交在 EC2 机器和我自己的 VM 上具有不同行为的原因是远程机器中的分支尚未合并,因此引入泄漏的提交并没有毒化环境。 The takeaway here is: make sure the code you're testing is the same, not just the commit.这里的要点是:确保您正在测试的代码是相同的,而不仅仅是提交。
  2. Low-level memory analysis might help in some cases but it's not a skill you pick up in half a day: I spent a long time trying to make sense of allocations and objects and whatever without getting any closer to the solution.低级 memory 分析在某些情况下可能会有所帮助,但这不是你半天就能掌握的技能:我花了很长时间试图理解分配和对象以及其他任何东西,但没有更接近解决方案。
  3. This kind of mistake can be incredibly costly - if I had a few hundred fewer tests, I wouldn't have ended up with an OOM error, and I probably wouldn't have noticed the problem at all.这种错误的代价可能非常高——如果我少做几百个测试,我就不会遇到 OOM 错误,而且我可能根本不会注意到这个问题。 Until it was in production, that is.直到它投入生产,就是这样。 That could be fixed with some kind of linter/static analysis too, if there were one which flags this kind of construction as potentially harmful.这也可以通过某种 linter/static 分析来解决,如果有一个将这种结构标记为潜在有害的。 Unfortunately, there isn't one (that I could find).不幸的是,没有一个(我能找到)。
  4. git bisect is your friend, as long as you can find a commit that actually works. git bisect是你的朋友,只要你能找到一个真正有效的提交。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM