简体   繁体   中英

DotNet Core 2.1 hoarding memory in Linux

I have a websocket server that hoards memory during days, till the point that Kubernetes eventually kills it. We monitor it using prometheous-net .

# dotnet --info

Host (useful for support):
  Version: 2.1.6
  Commit:  3f4f8eebd8

.NET Core SDKs installed:
  No SDKs were found.

.NET Core runtimes installed:
  Microsoft.AspNetCore.All 2.1.6 [/usr/share/dotnet/shared/Microsoft.AspNetCore.All]
  Microsoft.AspNetCore.App 2.1.6 [/usr/share/dotnet/shared/Microsoft.AspNetCore.App]
  Microsoft.NETCore.App 2.1.6 [/usr/share/dotnet/shared/Microsoft.NETCore.App]

But when I connect remotely and take a memory dump (using createdump ), suddently the memory drops... without the service stopping, restarting or loosing any connected user. See the green line in the picture.

I can see in the graphs, that GC is collecting regularly in all generations.

GC Server is disabled using:

<PropertyGroup>
  <ServerGarbageCollection>false</ServerGarbageCollection>
</PropertyGroup>

Before disabling GC Server, the service used to grow memory way faster. Now it takes two weeks to get into 512Mb.

Other services using ASP.NET Core on request/response fashion do not show this problem. This uses Websockets, where each connection last usually around 10 minutes... so I guess everything related with the connection survives till Gen 2 easily.

在此输入图像描述 Note that there are two pods, showing the same behaviour, and then one (the green) drops suddenly in memory ussage due the taking of the memory dump.

在此输入图像描述

The pods did not restart during the taking of the memory dump: 在此输入图像描述

No connection was lost or restarted.

Heap:

(lldb) eeheap -gc
Number of GC Heaps: 1
generation 0 starts at 0x00007F8481C8D0B0
generation 1 starts at 0x00007F8481C7E820
generation 2 starts at 0x00007F852A1D7000
ephemeral segment allocation context: none
         segment             begin         allocated              size
00007F852A1D6000  00007F852A1D7000  00007F853A1D5E90  0xfffee90(268430992)
00007F84807D0000  00007F84807D1000  00007F8482278000  0x1aa7000(27947008)
Large object heap starts at 0x00007F853A1D7000
         segment             begin         allocated              size
00007F853A1D6000  00007F853A1D7000  00007F853A7C60F8  0x5ef0f8(6222072)
Total Size:              Size: 0x12094f88 (302600072) bytes.
------------------------------
GC Heap Size:            Size: 0x12094f88 (302600072) bytes.
(lldb)

Free objects:

(lldb) dumpheap -type Free -stat
Statistics:
              MT    Count    TotalSize Class Name
00000000010c52b0   219774     10740482      Free
Total 219774 objects

Is there any explanation to this behaviour?

The problem was the connection to RabbitMQ. Because we were using sort lived channels, the "auto-reconnect" feature of the RabbitMQ.Client was keeping a lot of state about dead channels. We switched off this configuration, since we do not need the "perks" of the "auto-reconnect" feature, and everything start working normally. It was a pain, but we basically had to setup a Windows deploy and do the usual memory analysis process with Windows tools (Jetbrains dotMemory in this case). Using lldb is not productive at all.

Disclaimer: I am no .NET Wizard.

But you should do two things to go with Kubernetes best practices:

  1. Define sensible resource limits for your app. If the app does not need more than 200MB memory define a resource limit to prevent the app from consuming all available host memory. But be aware that the Unix API to get available memory is not capable of processing the cgroup the process has and always outputs the host memory no matter what your cgroup says.

  2. Tell your app what this resource limit is. It seems like your app does not "feel the need" to free memory as there is plenty. Almost all applications, and frameworks as well, have a switch to define the maximum memory to be consumed. Tell your app this limit, and it will "see" memory pressure and perform a full GC (what I guess could be the problem here)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM