简体   繁体   English

DotNet Core 2.1在Linux中囤积内存

[英]DotNet Core 2.1 hoarding memory in Linux

I have a websocket server that hoards memory during days, till the point that Kubernetes eventually kills it. 我有一个websocket服务器,在几天内囤积内存,直到Kubernetes最终杀死它。 We monitor it using prometheous-net . 我们用prometheous-net监控它。

# dotnet --info

Host (useful for support):
  Version: 2.1.6
  Commit:  3f4f8eebd8

.NET Core SDKs installed:
  No SDKs were found.

.NET Core runtimes installed:
  Microsoft.AspNetCore.All 2.1.6 [/usr/share/dotnet/shared/Microsoft.AspNetCore.All]
  Microsoft.AspNetCore.App 2.1.6 [/usr/share/dotnet/shared/Microsoft.AspNetCore.App]
  Microsoft.NETCore.App 2.1.6 [/usr/share/dotnet/shared/Microsoft.NETCore.App]

But when I connect remotely and take a memory dump (using createdump ), suddently the memory drops... without the service stopping, restarting or loosing any connected user. 但是当我远程连接并进行内存转储(使用createdump )时,突然内存丢失......没有服务停止,重新启动或丢失任何连接的用户。 See the green line in the picture. 请参见图中的绿线。

I can see in the graphs, that GC is collecting regularly in all generations. 我可以在图表中看到,GC正在各代人中定期收集。

GC Server is disabled using: 使用以下命令禁用GC Server:

<PropertyGroup>
  <ServerGarbageCollection>false</ServerGarbageCollection>
</PropertyGroup>

Before disabling GC Server, the service used to grow memory way faster. 在禁用GC Server之前,用于以更快的速度增长内存的服务。 Now it takes two weeks to get into 512Mb. 现在进入512Mb需要两周时间。

Other services using ASP.NET Core on request/response fashion do not show this problem. 在请求/响应方式上使用ASP.NET Core的其他服务不会显示此问题。 This uses Websockets, where each connection last usually around 10 minutes... so I guess everything related with the connection survives till Gen 2 easily. 这使用Websockets,每个连接通常持续大约10分钟......所以我猜想与连接相关的一切都很容易存在到第2代。

在此输入图像描述 Note that there are two pods, showing the same behaviour, and then one (the green) drops suddenly in memory ussage due the taking of the memory dump. 请注意,有两个pod,显示相同的行为,然后由于获取内存转储,一个(绿色)突然在内存消息中下降。

在此输入图像描述

The pods did not restart during the taking of the memory dump: 在获取内存转储期间,pod未重新启动: 在此输入图像描述

No connection was lost or restarted. 没有连接丢失或重新启动。

Heap: 堆:

(lldb) eeheap -gc
Number of GC Heaps: 1
generation 0 starts at 0x00007F8481C8D0B0
generation 1 starts at 0x00007F8481C7E820
generation 2 starts at 0x00007F852A1D7000
ephemeral segment allocation context: none
         segment             begin         allocated              size
00007F852A1D6000  00007F852A1D7000  00007F853A1D5E90  0xfffee90(268430992)
00007F84807D0000  00007F84807D1000  00007F8482278000  0x1aa7000(27947008)
Large object heap starts at 0x00007F853A1D7000
         segment             begin         allocated              size
00007F853A1D6000  00007F853A1D7000  00007F853A7C60F8  0x5ef0f8(6222072)
Total Size:              Size: 0x12094f88 (302600072) bytes.
------------------------------
GC Heap Size:            Size: 0x12094f88 (302600072) bytes.
(lldb)

Free objects: 免费对象:

(lldb) dumpheap -type Free -stat
Statistics:
              MT    Count    TotalSize Class Name
00000000010c52b0   219774     10740482      Free
Total 219774 objects

Is there any explanation to this behaviour? 这种行为有什么解释吗?

The problem was the connection to RabbitMQ. 问题是与RabbitMQ的连接。 Because we were using sort lived channels, the "auto-reconnect" feature of the RabbitMQ.Client was keeping a lot of state about dead channels. 因为我们使用了实时频道,所以RabbitMQ.Client的“自动重新连接”功能保留了很多关于死信道的状态。 We switched off this configuration, since we do not need the "perks" of the "auto-reconnect" feature, and everything start working normally. 我们关闭了这个配置,因为我们不需要“自动重新连接”功能的“特权”,一切都开始正常工作。 It was a pain, but we basically had to setup a Windows deploy and do the usual memory analysis process with Windows tools (Jetbrains dotMemory in this case). 这是一个痛苦,但我们基本上不得不设置Windows部署并使用Windows工具(在本例中为Jetbrains dotMemory)执行常规的内存分析过程。 Using lldb is not productive at all. 使用lldb根本没有效率。

Disclaimer: I am no .NET Wizard. 免责声明:我不是.NET向导。

But you should do two things to go with Kubernetes best practices: 但是你应该对Kubernetes最佳实践做两件事:

  1. Define sensible resource limits for your app. 为您的应用定义合理的资源限制。 If the app does not need more than 200MB memory define a resource limit to prevent the app from consuming all available host memory. 如果应用程序不需要超过200MB的内存,则定义资源限制以防止应用程序占用所有可用的主机内存。 But be aware that the Unix API to get available memory is not capable of processing the cgroup the process has and always outputs the host memory no matter what your cgroup says. 但请注意,获取可用内存的Unix API无法处理该进程所具有的cgroup,并且无论cgroup说什么,都始终输出主机内存。

  2. Tell your app what this resource limit is. 告诉你的应用这个资源限制是什么。 It seems like your app does not "feel the need" to free memory as there is plenty. 看起来你的应用程序没有“感觉有必要”释放内存,因为有很多。 Almost all applications, and frameworks as well, have a switch to define the maximum memory to be consumed. 几乎所有应用程序和框架都有一个开关来定义要消耗的最大内存。 Tell your app this limit, and it will "see" memory pressure and perform a full GC (what I guess could be the problem here) 告诉你的应用这个限制,它会“看到”内存压力并执行一个完整的GC(我猜这可能是问题)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM