简体   繁体   中英

Understanding OOM odd behaviour?

My server trigged OOM killer and I am trying to understand why. System has lot of RAM 128 GB and it looks like around 70GB of it was actually used. Reading through previous questions about OOM, it looks like this might be a case of memory fragmentation. See the syslog output

Jun 23 17:20:10 server1 kernel: [517262.504589] gmond invoked oom-killer: gfp_mask=0x201da, order=0, oom_score_adj=0
Jun 23 17:20:10 server1 kernel: [517262.504593] gmond cpuset=/ mems_allowed=0-1
Jun 23 17:20:10 server1 kernel: [517262.504598] CPU: 4 PID: 1522 Comm: gmond Tainted: P           OE 3.15.1-031501-lowlatency #201406161841
Jun 23 17:20:10 server1 kernel: [517262.504599] Hardware name: Dell Inc. PowerEdge R420/0K29HN, BIOS 2.3.3 07/10/2014
Jun 23 17:20:10 server1 kernel: [517262.504601]  0000000000000000 ffff880fce2ab848 ffffffff817746ec 0000000000000007
Jun 23 17:20:10 server1 kernel: [517262.504603]  ffff880f74691950 ffff880fce2ab898 ffffffff8176a980 ffff880f00000000
Jun 23 17:20:10 server1 kernel: [517262.504605]  000201da81383df8 ffff881470376540 ffff881dcf7ab2a0 0000000000000000
Jun 23 17:20:10 server1 kernel: [517262.504607] Call Trace:
Jun 23 17:20:10 server1 kernel: [517262.504615]  [<ffffffff817746ec>] dump_stack+0x4e/0x71
Jun 23 17:20:10 server1 kernel: [517262.504618]  [<ffffffff8176a980>] dump_header+0x7e/0xbd
Jun 23 17:20:10 server1 kernel: [517262.504620]  [<ffffffff8176aa16>] oom_kill_process.part.6+0x57/0x30a
Jun 23 17:20:10 server1 kernel: [517262.504623]  [<ffffffff811654e7>] oom_kill_process+0x47/0x50
Jun 23 17:20:10 server1 kernel: [517262.504625]  [<ffffffff81165825>] out_of_memory+0x145/0x1d0
Jun 23 17:20:10 server1 kernel: [517262.504628]  [<ffffffff8116c1ba>] __alloc_pages_nodemask+0xb1a/0xc40
Jun 23 17:20:10 server1 kernel: [517262.504634]  [<ffffffff811adba3>] alloc_pages_current+0xb3/0x180
Jun 23 17:20:10 server1 kernel: [517262.504636]  [<ffffffff81161737>] __page_cache_alloc+0xb7/0xd0
Jun 23 17:20:10 server1 kernel: [517262.504638]  [<ffffffff81163f80>] filemap_fault+0x280/0x430
Jun 23 17:20:10 server1 kernel: [517262.504642]  [<ffffffff8118a0d9>] __do_fault+0x39/0x90
Jun 23 17:20:10 server1 kernel: [517262.504644]  [<ffffffff8118e31e>] do_read_fault.isra.59+0x10e/0x1d0
Jun 23 17:20:10 server1 kernel: [517262.504646]  [<ffffffff8118e870>] do_linear_fault.isra.61+0x70/0x80
Jun 23 17:20:10 server1 kernel: [517262.504647]  [<ffffffff8118e986>] handle_pte_fault+0x76/0x1b0
Jun 23 17:20:10 server1 kernel: [517262.504652]  [<ffffffff81095fe0>] ? lock_hrtimer_base.isra.25+0x30/0x60
Jun 23 17:20:10 server1 kernel: [517262.504654]  [<ffffffff8118eea4>] __handle_mm_fault+0x1b4/0x360
Jun 23 17:20:10 server1 kernel: [517262.504655]  [<ffffffff8118f101>] handle_mm_fault+0xb1/0x160
Jun 23 17:20:10 server1 kernel: [517262.504658]  [<ffffffff81784667>] ? __do_page_fault+0x2b7/0x5a0
Jun 23 17:20:10 server1 kernel: [517262.504660]  [<ffffffff81784522>] __do_page_fault+0x172/0x5a0
Jun 23 17:20:10 server1 kernel: [517262.504664]  [<ffffffff8111fdec>] ? acct_account_cputime+0x1c/0x20
Jun 23 17:20:10 server1 kernel: [517262.504667]  [<ffffffff810a73a9>] ? account_user_time+0x99/0xb0
Jun 23 17:20:10 server1 kernel: [517262.504669]  [<ffffffff810a79dd>] ? vtime_account_user+0x5d/0x70
Jun 23 17:20:10 server1 kernel: [517262.504671]  [<ffffffff8178498e>] do_page_fault+0x3e/0x80
Jun 23 17:20:10 server1 kernel: [517262.504673]  [<ffffffff817811f8>] page_fault+0x28/0x30
Jun 23 17:20:10 server1 kernel: [517262.504674] Mem-Info:
Jun 23 17:20:10 server1 kernel: [517262.504675] Node 0 DMA per-cpu:
Jun 23 17:20:10 server1 kernel: [517262.504677] CPU    0: hi:    0, btch:   1 usd:   0
Jun 23 17:20:10 server1 kernel: [517262.504678] CPU    1: hi:    0, btch:   1 usd:   0
Jun 23 17:20:10 server1 kernel: [517262.504679] CPU    2: hi:    0, btch:   1 usd:   0
Jun 23 17:20:10 server1 kernel: [517262.504680] CPU    3: hi:    0, btch:   1 usd:   0
Jun 23 17:20:10 server1 kernel: [517262.504681] CPU    4: hi:    0, btch:   1 usd:   0
Jun 23 17:20:10 server1 kernel: [517262.504682] CPU    5: hi:    0, btch:   1 usd:   0
Jun 23 17:20:10 server1 kernel: [517262.504683] CPU    6: hi:    0, btch:   1 usd:   0
Jun 23 17:20:10 server1 kernel: [517262.504684] CPU    7: hi:    0, btch:   1 usd:   0
Jun 23 17:20:10 server1 kernel: [517262.504685] CPU    8: hi:    0, btch:   1 usd:   0
Jun 23 17:20:10 server1 kernel: [517262.504686] CPU    9: hi:    0, btch:   1 usd:   0
Jun 23 17:20:10 server1 kernel: [517262.504687] CPU   10: hi:    0, btch:   1 usd:   0
Jun 23 17:20:10 server1 kernel: [517262.504687] CPU   11: hi:    0, btch:   1 usd:   0
Jun 23 17:20:10 server1 kernel: [517262.504688] CPU   12: hi:    0, btch:   1 usd:   0
Jun 23 17:20:10 server1 kernel: [517262.504689] CPU   13: hi:    0, btch:   1 usd:   0
Jun 23 17:20:10 server1 kernel: [517262.504690] CPU   14: hi:    0, btch:   1 usd:   0
Jun 23 17:20:10 server1 kernel: [517262.504691] CPU   15: hi:    0, btch:   1 usd:   0
Jun 23 17:20:10 server1 kernel: [517262.504692] CPU   16: hi:    0, btch:   1 usd:   0
Jun 23 17:20:10 server1 kernel: [517262.504693] CPU   17: hi:    0, btch:   1 usd:   0
Jun 23 17:20:10 server1 kernel: [517262.504694] CPU   18: hi:    0, btch:   1 usd:   0
Jun 23 17:20:10 server1 kernel: [517262.504695] CPU   19: hi:    0, btch:   1 usd:   0
Jun 23 17:20:10 server1 kernel: [517262.504696] CPU   20: hi:    0, btch:   1 usd:   0
Jun 23 17:20:10 server1 kernel: [517262.504697] CPU   21: hi:    0, btch:   1 usd:   0
Jun 23 17:20:10 server1 kernel: [517262.504698] CPU   22: hi:    0, btch:   1 usd:   0
Jun 23 17:20:10 server1 kernel: [517262.504698] CPU   23: hi:    0, btch:   1 usd:   0
Jun 23 17:20:10 server1 kernel: [517262.504699] Node 0 DMA32 per-cpu:
Jun 23 17:20:10 server1 kernel: [517262.504701] CPU    0: hi:  186, btch:  31 usd:  30
Jun 23 17:20:10 server1 kernel: [517262.504702] CPU    1: hi:  186, btch:  31 usd:   0
Jun 23 17:20:10 server1 kernel: [517262.504703] CPU    2: hi:  186, btch:  31 usd:  34
Jun 23 17:20:10 server1 kernel: [517262.504704] CPU    3: hi:  186, btch:  31 usd:  27
Jun 23 17:20:10 server1 kernel: [517262.504705] CPU    4: hi:  186, btch:  31 usd:  30
Jun 23 17:20:10 server1 kernel: [517262.504705] CPU    5: hi:  186, btch:  31 usd:   0
Jun 23 17:20:10 server1 kernel: [517262.504706] CPU    6: hi:  186, btch:  31 usd:   0
Jun 23 17:20:10 server1 kernel: [517262.504707] CPU    7: hi:  186, btch:  31 usd:   0
Jun 23 17:20:10 server1 kernel: [517262.504708] CPU    8: hi:  186, btch:  31 usd: 173
Jun 23 17:20:10 server1 kernel: [517262.504709] CPU    9: hi:  186, btch:  31 usd:   0
Jun 23 17:20:10 server1 kernel: [517262.504710] CPU   10: hi:  186, btch:  31 usd:   0
Jun 23 17:20:10 server1 kernel: [517262.504711] CPU   11: hi:  186, btch:  31 usd:   0
Jun 23 17:20:10 server1 kernel: [517262.504712] CPU   12: hi:  186, btch:  31 usd:   0
Jun 23 17:20:10 server1 kernel: [517262.504713] CPU   13: hi:  186, btch:  31 usd:   0
Jun 23 17:20:10 server1 kernel: [517262.504714] CPU   14: hi:  186, btch:  31 usd:   0
Jun 23 17:20:10 server1 kernel: [517262.504715] CPU   15: hi:  186, btch:  31 usd:   0
Jun 23 17:20:10 server1 kernel: [517262.504716] CPU   16: hi:  186, btch:  31 usd:   0
Jun 23 17:20:10 server1 kernel: [517262.504717] CPU   17: hi:  186, btch:  31 usd:   0
Jun 23 17:20:10 server1 kernel: [517262.504718] CPU   18: hi:  186, btch:  31 usd:   0
Jun 23 17:20:10 server1 kernel: [517262.504719] CPU   19: hi:  186, btch:  31 usd:   0
Jun 23 17:20:10 server1 kernel: [517262.504720] CPU   20: hi:  186, btch:  31 usd:   0
Jun 23 17:20:10 server1 kernel: [517262.504721] CPU   21: hi:  186, btch:  31 usd:   0
Jun 23 17:20:10 server1 kernel: [517262.504722] CPU   22: hi:  186, btch:  31 usd:   0
Jun 23 17:20:10 server1 kernel: [517262.504722] CPU   23: hi:  186, btch:  31 usd:   0
Jun 23 17:20:10 server1 kernel: [517262.504723] Node 0 Normal per-cpu:
Jun 23 17:20:10 server1 kernel: [517262.504724] CPU    0: hi:  186, btch:  31 usd:   0
Jun 23 17:20:10 server1 kernel: [517262.504725] CPU    1: hi:  186, btch:  31 usd:   0
Jun 23 17:20:10 server1 kernel: [517262.504726] CPU    2: hi:  186, btch:  31 usd:   0
Jun 23 17:20:10 server1 kernel: [517262.504727] CPU    3: hi:  186, btch:  31 usd:  14
Jun 23 17:20:10 server1 kernel: [517262.504728] CPU    4: hi:  186, btch:  31 usd:   0
Jun 23 17:20:10 server1 kernel: [517262.504729] CPU    5: hi:  186, btch:  31 usd:   0
Jun 23 17:20:10 server1 kernel: [517262.504730] CPU    6: hi:  186, btch:  31 usd:   0
Jun 23 17:20:10 server1 kernel: [517262.504731] CPU    7: hi:  186, btch:  31 usd:   0
Jun 23 17:20:10 server1 kernel: [517262.504732] CPU    8: hi:  186, btch:  31 usd:   0
Jun 23 17:20:10 server1 kernel: [517262.504733] CPU    9: hi:  186, btch:  31 usd:   0
Jun 23 17:20:10 server1 kernel: [517262.504734] CPU   10: hi:  186, btch:  31 usd:   0
Jun 23 17:20:10 server1 kernel: [517262.504735] CPU   11: hi:  186, btch:  31 usd:   0
Jun 23 17:20:10 server1 kernel: [517262.504736] CPU   12: hi:  186, btch:  31 usd:   0
Jun 23 17:20:10 server1 kernel: [517262.504737] CPU   13: hi:  186, btch:  31 usd:   0
Jun 23 17:20:10 server1 kernel: [517262.504738] CPU   14: hi:  186, btch:  31 usd:   0
Jun 23 17:20:10 server1 kernel: [517262.504739] CPU   15: hi:  186, btch:  31 usd:   0
Jun 23 17:20:10 server1 kernel: [517262.504740] CPU   16: hi:  186, btch:  31 usd:   0
Jun 23 17:20:10 server1 kernel: [517262.504740] CPU   17: hi:  186, btch:  31 usd:   0
Jun 23 17:20:10 server1 kernel: [517262.504741] CPU   18: hi:  186, btch:  31 usd:   0
Jun 23 17:20:10 server1 kernel: [517262.504742] CPU   19: hi:  186, btch:  31 usd:   0
Jun 23 17:20:10 server1 kernel: [517262.504743] CPU   20: hi:  186, btch:  31 usd:   0
Jun 23 17:20:10 server1 kernel: [517262.504744] CPU   21: hi:  186, btch:  31 usd:   0
Jun 23 17:20:10 server1 kernel: [517262.504745] CPU   22: hi:  186, btch:  31 usd:   0
Jun 23 17:20:10 server1 kernel: [517262.504746] CPU   23: hi:  186, btch:  31 usd:   0
Jun 23 17:20:10 server1 kernel: [517262.504747] Node 1 Normal per-cpu:
Jun 23 17:20:10 server1 kernel: [517262.504748] CPU    0: hi:  186, btch:  31 usd:   0
Jun 23 17:20:10 server1 kernel: [517262.504749] CPU    1: hi:  186, btch:  31 usd:   0
Jun 23 17:20:10 server1 kernel: [517262.504750] CPU    2: hi:  186, btch:  31 usd:   0
Jun 23 17:20:10 server1 kernel: [517262.504751] CPU    3: hi:  186, btch:  31 usd:   0
Jun 23 17:20:10 server1 kernel: [517262.504752] CPU    4: hi:  186, btch:  31 usd:   0
Jun 23 17:20:10 server1 kernel: [517262.504753] CPU    5: hi:  186, btch:  31 usd:   0
Jun 23 17:20:10 server1 kernel: [517262.504754] CPU    6: hi:  186, btch:  31 usd:   0
Jun 23 17:20:10 server1 kernel: [517262.504755] CPU    7: hi:  186, btch:  31 usd:   0
Jun 23 17:20:10 server1 kernel: [517262.504756] CPU    8: hi:  186, btch:  31 usd:   0
Jun 23 17:20:10 server1 kernel: [517262.504757] CPU    9: hi:  186, btch:  31 usd:   0
Jun 23 17:20:10 server1 kernel: [517262.504758] CPU   10: hi:  186, btch:  31 usd:   0
Jun 23 17:20:10 server1 kernel: [517262.504758] CPU   11: hi:  186, btch:  31 usd:   0
Jun 23 17:20:10 server1 kernel: [517262.504759] CPU   12: hi:  186, btch:  31 usd:   0
Jun 23 17:20:10 server1 kernel: [517262.504760] CPU   13: hi:  186, btch:  31 usd:   0
Jun 23 17:20:10 server1 kernel: [517262.504761] CPU   14: hi:  186, btch:  31 usd:   0
Jun 23 17:20:10 server1 kernel: [517262.504762] CPU   15: hi:  186, btch:  31 usd:   0
Jun 23 17:20:10 server1 kernel: [517262.504763] CPU   16: hi:  186, btch:  31 usd:   0
Jun 23 17:20:10 server1 kernel: [517262.504764] CPU   17: hi:  186, btch:  31 usd:   0
Jun 23 17:20:10 server1 kernel: [517262.504765] CPU   18: hi:  186, btch:  31 usd:   0
Jun 23 17:20:10 server1 kernel: [517262.504766] CPU   19: hi:  186, btch:  31 usd:   0
Jun 23 17:20:10 server1 kernel: [517262.504767] CPU   20: hi:  186, btch:  31 usd:   0
Jun 23 17:20:10 server1 kernel: [517262.504768] CPU   21: hi:  186, btch:  31 usd:   0
Jun 23 17:20:10 server1 kernel: [517262.504769] CPU   22: hi:  186, btch:  31 usd:   0
Jun 23 17:20:10 server1 kernel: [517262.504770] CPU   23: hi:  186, btch:  31 usd:   0
Jun 23 17:20:10 server1 kernel: [517262.504773] active_anon:17833290 inactive_anon:2465707 isolated_anon:0
Jun 23 17:20:10 server1 kernel: [517262.504773]  active_file:573 inactive_file:595 isolated_file:36
Jun 23 17:20:10 server1 kernel: [517262.504773]  unevictable:0 dirty:4 writeback:0 unstable:0
Jun 23 17:20:10 server1 kernel: [517262.504773]  free:82698 slab_reclaimable:43224 slab_unreclaimable:11476749
Jun 23 17:20:10 server1 kernel: [517262.504773]  mapped:2465518 shmem:2465767 pagetables:66385 bounce:0
Jun 23 17:20:10 server1 kernel: [517262.504773]  free_cma:0
Jun 23 17:20:10 server1 kernel: [517262.504776] Node 0 DMA free:14804kB min:8kB low:8kB high:12kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15968kB managed:15828kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
Jun 23 17:20:10 server1 kernel: [517262.504779] lowmem_reserve[]: 0 2933 64370 64370
Jun 23 17:20:10 server1 kernel: [517262.504782] Node 0 DMA32 free:247776kB min:2048kB low:2560kB high:3072kB active_anon:1774744kB inactive_anon:607052kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:3083200kB managed:3003592kB mlocked:0kB dirty:16kB writeback:0kB mapped:607068kB shmem:607068kB slab_reclaimable:25524kB slab_unreclaimable:302060kB kernel_stack:4928kB pagetables:3100kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:2660 all_unreclaimable? yes
Jun 23 17:20:10 server1 kernel: [517262.504785] lowmem_reserve[]: 0 0 61436 61436
Jun 23 17:20:10 server1 kernel: [517262.504787] Node 0 Normal free:34728kB min:42952kB low:53688kB high:64428kB active_anon:30286072kB inactive_anon:9255576kB active_file:236kB inactive_file:640kB unevictable:0kB isolated(anon):0kB isolated(file):16kB present:63963136kB managed:62911420kB mlocked:0kB dirty:0kB writeback:0kB mapped:9255000kB shmem:9255724kB slab_reclaimable:86416kB slab_unreclaimable:22165372kB kernel_stack:21072kB pagetables:121112kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:13936 all_unreclaimable? yes
Jun 23 17:20:10 server1 kernel: [517262.504791] lowmem_reserve[]: 0 0 0 0
Jun 23 17:20:10 server1 kernel: [517262.504793] Node 1 Normal free:33484kB min:45096kB low:56368kB high:67644kB active_anon:39272344kB inactive_anon:200kB active_file:2112kB inactive_file:1752kB unevictable:0kB isolated(anon):0kB isolated(file):128kB present:67108864kB managed:66056916kB mlocked:0kB dirty:0kB writeback:0kB mapped:4kB shmem:276kB slab_reclaimable:60956kB slab_unreclaimable:23439564kB kernel_stack:13536kB pagetables:141328kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:18448 all_unreclaimable? yes
Jun 23 17:20:10 server1 kernel: [517262.504797] lowmem_reserve[]: 0 0 0 0
Jun 23 17:20:10 server1 kernel: [517262.504799] Node 0 DMA: 1*4kB (U) 0*8kB 1*16kB (U) 0*32kB 1*64kB (U) 1*128kB (U) 1*256kB (U) 0*512kB 0*1024kB 1*2048kB (R) 3*4096kB (M) = 14804kB
Jun 23 17:20:10 server1 kernel: [517262.504807] Node 0 DMA32: 4660*4kB (UEM) 2172*8kB (EM) 1739*16kB (EM) 1046*32kB (UEM) 629*64kB (EM) 344*128kB (UEM) 155*256kB (E) 46*512kB (UE) 3*1024kB (E) 0*2048kB 0*4096kB = 247904kB
Jun 23 17:20:10 server1 kernel: [517262.504816] Node 0 Normal: 9038*4kB (M) 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 36152kB
Jun 23 17:20:10 server1 kernel: [517262.504822] Node 1 Normal: 9055*4kB (UM) 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 36220kB
Jun 23 17:20:10 server1 kernel: [517262.504829] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
Jun 23 17:20:10 server1 kernel: [517262.504830] Node 1 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
Jun 23 17:20:10 server1 kernel: [517262.504831] 2467056 total pagecache pages
Jun 23 17:20:10 server1 kernel: [517262.504832] 0 pages in swap cache
Jun 23 17:20:10 server1 kernel: [517262.504833] Swap cache stats: add 0, delete 0, find 0/0
Jun 23 17:20:10 server1 kernel: [517262.504834] Free swap  = 0kB
Jun 23 17:20:10 server1 kernel: [517262.504834] Total swap = 0kB
Jun 23 17:20:10 server1 kernel: [517262.504835] 33542792 pages RAM
Jun 23 17:20:10 server1 kernel: [517262.504836] 0 pages HighMem/MovableOnly
Jun 23 17:20:10 server1 kernel: [517262.504837] 262987 pages reserved
Jun 23 17:20:10 server1 kernel: [517262.504838] 0 pages hwpoisoned
Jun 23 17:20:10 server1 kernel: [517262.504839] [ pid ]   uid  tgid total_vm      rss nr_ptes swapents oom_score_adj name
Jun 23 17:20:10 server1 kernel: [517262.504866] [  569]     0   569     4997      144      13        0             0 upstart-udev-br
Jun 23 17:20:10 server1 kernel: [517262.504868] [  578]     0   578    12891      187      29        0         -1000 systemd-udevd
Jun 23 17:20:10 server1 kernel: [517262.504873] [  692]   101   692    80659     2295      59        0             0 rsyslogd
Jun 23 17:20:10 server1 kernel: [517262.504875] [  750]     0   750     4084      331      13        0             0 upstart-file-br
Jun 23 17:20:10 server1 kernel: [517262.504877] [  792]     0   792     3815       53      13        0             0 upstart-socket-
Jun 23 17:20:10 server1 kernel: [517262.504877] [  792]     0   792     3815       53      13        0             0 upstart-socket-
Jun 23 17:20:10 server1 kernel: [517262.504879] [  842]   111   842    27001      275      53        0             0 dbus-daemon
Jun 23 17:20:10 server1 kernel: [517262.504880] [  851]     0   851     8834      101      22        0             0 systemd-logind
Jun 23 17:20:10 server1 kernel: [517262.504886] [ 1232]     0  1232     2558      572       8        0             0 dhclient
Jun 23 17:20:10 server1 kernel: [517262.504888] [ 1342]   104  1342    24484      281      49        0             0 ntpd
Jun 23 17:20:10 server1 kernel: [517262.504890] [ 1440]     0  1440     3955       41      12        0             0 getty
Jun 23 17:20:10 server1 kernel: [517262.504891] [ 1443]     0  1443     3955       41      12        0             0 getty
Jun 23 17:20:10 server1 kernel: [517262.504893] [ 1448]     0  1448     3955       39      13        0             0 getty
Jun 23 17:20:10 server1 kernel: [517262.504895] [ 1450]     0  1450     3955       41      13        0             0 getty
Jun 23 17:20:10 server1 kernel: [517262.504896] [ 1452]     0  1452     3955       42      13        0             0 getty
Jun 23 17:20:10 server1 kernel: [517262.504898] [ 1469]     0  1469     4785       40      13        0             0 atd
Jun 23 17:20:10 server1 kernel: [517262.504900] [ 1470]     0  1470    15341      168      32        0         -1000 sshd
Jun 23 17:20:10 server1 kernel: [517262.504902] [ 1472]     0  1472     5914       65      17        0             0 cron
Jun 23 17:20:10 server1 kernel: [517262.504904] [ 1478]   999  1478    16020     3710      31        0             0 gmond
Jun 23 17:20:10 server1 kernel: [517262.504905] [ 1486]     0  1486     4821       65      14        0             0 irqbalance
Jun 23 17:20:10 server1 kernel: [517262.504907] [ 1500]     0  1500   343627     1730      85        0             0 nscd                                                                                                          743,1          1%Jun 23 17:20:10 server1 kernel: [517262.504909] [ 1559]     0  1559     1092       37       8        0             0 acpid
Jun 23 17:20:10 server1 kernel: [517262.504911] [ 1641]     0  1641     4978       71      13        0             0 master
Jun 23 17:20:10 server1 kernel: [517262.504913] [ 1650]   103  1650     5427       72      14        0             0 qmgr
Jun 23 17:20:10 server1 kernel: [517262.504917] [ 1895]     0  1895     1900       30       9        0             0 getty
Jun 23 17:20:10 server1 kernel: [517262.504919] [ 1906]  1000  1906  2854329     2610    2594        0             0 thttpd
Jun 23 17:20:10 server1 kernel: [517262.504927] [ 3163]  1000  3163     2432       39      10        0             0 searchd
Jun 23 17:20:10 server1 kernel: [517262.504928] [ 3167]  1000  3167  2727221  2467025    4863        0             0 sphinx-daemon
Jun 23 17:20:10 server1 kernel: [517262.504931] [47622]  1000 47622 17834794 17329575   33989        0             0 MyExec

<.................Trimmed bunch of processes with low mem usage.......................................>


Jun 23 17:20:10 server1 kernel: [517262.508350] Out of memory: Kill process 47622 (MyExec) score 526 or sacrifice child
Jun 23 17:20:10 server1 kernel: [517262.508375] Killed process 47622 (MyExec) total-vm:71339176kB, anon-rss:69318300kB, file-rss:0kB

Looking at following lines, it seems like issue is fragmentation.

Jun 23 17:20:10 server1 kernel: [517262.504816] Node 0 Normal: 9038*4kB (M) 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 36152kB
Jun 23 17:20:10 server1 kernel: [517262.504822] Node 1 Normal: 9055*4kB (UM) 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 36220kB

I have no idea as why the system would be so badly fragmented. It was only running for 5 days when this happened. Also looking at the process that invoked the oom killer (gmond invoked oom-killer: gfp_mask=0x201da, order=0, oom_score_adj=0), seems like it was only requesting 4K blocks and there are bunch of those available.

  1. Is my understanding of fragmentation correct in this case?
  2. How can I figure why the memory got so fragmented?
  3. What can I do to avoid getting into this situation.

One thing that you can notice is, I have completely turned off swap and have swappiness set to 0. The reason is my system has more than enough RAM and should never hit swap. I am planning to enable it and set swappiness to 10. I am not sure if that helps in this case.

Thanks for your input.

From the last few lines of the logs you can see the kernel reports a total-vm usage 71339176kB (~71GiB) while total vm should include both your physical memory and swap space. Also your log shows resident memory about ~69GiB.

Is my understanding of fragmentation correct in this case?

If your capturing system diagnostics during the time the issue occured or sosreport, check the /proc/buddyinfo file for any memory fragmentation. Its best to write a script and backup this info if you are planning to reproducing this.

How can I figure why the memory got so fragmented? What can I do to avoid getting into this situation. Sometimes applications overcommit memory which the system is unable to honour potentially leading to OOM. You may want to modify and check the other kernel tunable or try to disable memory overcommitting using sysctl -a for reading the set values.

vm.overcommit_memory=2 vm.overcommit_ratio=80

Note: After adding the above lines in /etc/sysctl.conf its best to restart the system.

vm.overcommit: some apps require to alloc more virtual memory for the program, more then what is available on the system. vm.overcommit take different value, 0 - a heuristic overcommit algorithm is used
1 - always overcommit regardless of whether memory is available or not (most likely set on your server its set to 0 or 1). 2 - this tell the kernel to allow apps to commit all swap + %of ram, for this the below value should also be set (ex: set to 80%) 2- using this would disallow overcommiting the memory usage (beyond the available RAM + 80% of swap space)

Understanding of fragmentation is incorrect. The oom was issued because of memory watermarks were broken. Take a look at this:

Node 0 Normal free:34728kB min:42952kB low:53688kB
Node 1 Normal free:33484kB min:45096kB low:56368kB

Updating with slabinfo This is after the node was rebooted.

# name            <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab> : tunables <limit> <batchcount> <sharedfactor> : slabdata <active_slabs> <num_slabs> <sharedavail>
kvm_async_pf           0      0    136   30    1 : tunables    0    0    0 : slabdata      0      0      0
kvm_vcpu               0      0  16256    2    8 : tunables    0    0    0 : slabdata      0      0      0
kvm_mmu_page_header      0      0    168   48    2 : tunables    0    0    0 : slabdata      0      0      0
fusion_ioctx        5005   5005    296   55    4 : tunables    0    0    0 : slabdata     91     91      0
fusion_user_ll_request      0      0   3960    8    8 : tunables    0    0    0 : slabdata      0      0      0
ext4_groupinfo_4k 131670 131670    136   30    1 : tunables    0    0    0 : slabdata   4389   4389      0
ip6_dst_cache       1260   1260    384   42    4 : tunables    0    0    0 : slabdata     30     30      0
UDPLITEv6              0      0   1088   30    8 : tunables    0    0    0 : slabdata      0      0      0
UDPv6                330    330   1088   30    8 : tunables    0    0    0 : slabdata     11     11      0
tw_sock_TCPv6        128    128    256   32    2 : tunables    0    0    0 : slabdata      4      4      0
TCPv6                288    288   1984   16    8 : tunables    0    0    0 : slabdata     18     18      0
kcopyd_job             0      0   3312    9    8 : tunables    0    0    0 : slabdata      0      0      0
dm_uevent              0      0   2632   12    8 : tunables    0    0    0 : slabdata      0      0      0
cfq_queue              0      0    232   35    2 : tunables    0    0    0 : slabdata      0      0      0
bsg_cmd                0      0    312   52    4 : tunables    0    0    0 : slabdata      0      0      0
mqueue_inode_cache     36     36    896   36    8 : tunables    0    0    0 : slabdata      1      1      0
fuse_request           0      0    416   39    4 : tunables    0    0    0 : slabdata      0      0      0
fuse_inode             0      0    768   42    8 : tunables    0    0    0 : slabdata      0      0      0
ecryptfs_key_record_cache      0      0    576   28    4 : tunables    0    0    0 : slabdata      0      0      0
ecryptfs_inode_cache      0      0   1024   32    8 : tunables    0    0    0 : slabdata      0      0      0
fat_inode_cache        0      0    712   46    8 : tunables    0    0    0 : slabdata      0      0      0
fat_cache              0      0     40  102    1 : tunables    0    0    0 : slabdata      0      0      0
hugetlbfs_inode_cache     54     54    600   54    8 : tunables    0    0    0 : slabdata      1      1      0
jbd2_journal_handle   2040   2040     48   85    1 : tunables    0    0    0 : slabdata     24     24      0
jbd2_journal_head   5071   5364    112   36    1 : tunables    0    0    0 : slabdata    149    149      0
jbd2_revoke_table_s   1792   1792     16  256    1 : tunables    0    0    0 : slabdata      7      7      0
jbd2_revoke_record_s   1536   1536     32  128    1 : tunables    0    0    0 : slabdata     12     12      0
ext4_inode_cache   75129  78771    984   33    8 : tunables    0    0    0 : slabdata   2387   2387      0
ext4_free_data      5952   6656     64   64    1 : tunables    0    0    0 : slabdata    104    104      0
ext4_allocation_context    768    768    128   32    1 : tunables    0    0    0 : slabdata     24     24      0
ext4_io_end         1344   1344     72   56    1 : tunables    0    0    0 : slabdata     24     24      0
ext4_extent_status  37921  38352     40  102    1 : tunables    0    0    0 : slabdata    376    376      0
dquot                768    768    256   32    2 : tunables    0    0    0 : slabdata     24     24      0
dnotify_mark         782    782    120   34    1 : tunables    0    0    0 : slabdata     23     23      0
pid_namespace          0      0   2192   14    8 : tunables    0    0    0 : slabdata      0      0      0
posix_timers_cache      0      0    248   33    2 : tunables    0    0    0 : slabdata      0      0      0
UDP-Lite               0      0    896   36    8 : tunables    0    0    0 : slabdata      0      0      0
xfrm_dst_cache         0      0    448   36    4 : tunables    0    0    0 : slabdata      0      0      0
ip_fib_trie          146    146     56   73    1 : tunables    0    0    0 : slabdata      2      2      0
UDP                  828    828    896   36    8 : tunables    0    0    0 : slabdata     23     23      0
tw_sock_TCP          992   1152    256   32    2 : tunables    0    0    0 : slabdata     36     36      0
TCP                  450    450   1792   18    8 : tunables    0    0    0 : slabdata     25     25      0
blkdev_queue         120    136   1896   17    8 : tunables    0    0    0 : slabdata      8      8      0
blkdev_requests     3358   3569    376   43    4 : tunables    0    0    0 : slabdata     83     83      0
blkdev_ioc           964   1287    104   39    1 : tunables    0    0    0 : slabdata     33     33      0
user_namespace         0      0    264   31    2 : tunables    0    0    0 : slabdata      0      0      0
sock_inode_cache    1377   1377    640   51    8 : tunables    0    0    0 : slabdata     27     27      0
net_namespace          0      0   4736    6    8 : tunables    0    0    0 : slabdata      0      0      0
shmem_inode_cache   2112   2112    672   48    8 : tunables    0    0    0 : slabdata     44     44      0
ftrace_event_file   1196   1196     88   46    1 : tunables    0    0    0 : slabdata     26     26      0
taskstats            196    196    328   49    4 : tunables    0    0    0 : slabdata      4      4      0
proc_inode_cache   63037  63250    648   50    8 : tunables    0    0    0 : slabdata   1265   1265      0
sigqueue            1224   1224    160   51    2 : tunables    0    0    0 : slabdata     24     24      0
bdev_cache           819    819    832   39    8 : tunables    0    0    0 : slabdata     21     21      0
kernfs_node_cache  54360  54360    112   36    1 : tunables    0    0    0 : slabdata   1510   1510      0
mnt_cache            510    510    320   51    4 : tunables    0    0    0 : slabdata     10     10      0
inode_cache        16813  19712    584   28    4 : tunables    0    0    0 : slabdata    704    704      0
dentry            144206 144606    192   42    2 : tunables    0    0    0 : slabdata   3443   3443      0
iint_cache             0      0     72   56    1 : tunables    0    0    0 : slabdata      0      0      0
buffer_head       6905641 6922305    104   39    1 : tunables    0    0    0 : slabdata 177495 177495      0
vm_area_struct     16764  16764    184   44    2 : tunables    0    0    0 : slabdata    381    381      0
mm_struct           1008   1008    896   36    8 : tunables    0    0    0 : slabdata     28     28      0
files_cache         1377   1377    640   51    8 : tunables    0    0    0 : slabdata     27     27      0
signal_cache        1380   1380   1088   30    8 : tunables    0    0    0 : slabdata     46     46      0
sighand_cache       1020   1020   2112   15    8 : tunables    0    0    0 : slabdata     68     68      0
task_xstate         1638   1638    832   39    8 : tunables    0    0    0 : slabdata     42     42      0
task_struct          837    855   6480    5    8 : tunables    0    0    0 : slabdata    171    171      0
Acpi-ParseExt       2968   2968     72   56    1 : tunables    0    0    0 : slabdata     53     53      0
Acpi-State           561    561     80   51    1 : tunables    0    0    0 : slabdata     11     11      0
Acpi-Namespace      3162   3162     40  102    1 : tunables    0    0    0 : slabdata     31     31      0
anon_vma           19313  19584     64   64    1 : tunables    0    0    0 : slabdata    306    306      0
shared_policy_node   7735   7735     48   85    1 : tunables    0    0    0 : slabdata     91     91      0
numa_policy          170    170     24  170    1 : tunables    0    0    0 : slabdata      1      1      0
radix_tree_node   2870899 2871624    584   28    4 : tunables    0    0    0 : slabdata 102558 102558      0
idr_layer_cache      555    555   2112   15    8 : tunables    0    0    0 : slabdata     37     37      0
dma-kmalloc-8192       0      0   8192    4    8 : tunables    0    0    0 : slabdata      0      0      0
dma-kmalloc-4096       0      0   4096    8    8 : tunables    0    0    0 : slabdata      0      0      0
dma-kmalloc-2048       0      0   2048   16    8 : tunables    0    0    0 : slabdata      0      0      0
dma-kmalloc-1024       0      0   1024   32    8 : tunables    0    0    0 : slabdata      0      0      0
dma-kmalloc-512        0      0    512   32    4 : tunables    0    0    0 : slabdata      0      0      0
dma-kmalloc-256        0      0    256   32    2 : tunables    0    0    0 : slabdata      0      0      0
dma-kmalloc-128        0      0    128   32    1 : tunables    0    0    0 : slabdata      0      0      0
dma-kmalloc-64         0      0     64   64    1 : tunables    0    0    0 : slabdata      0      0      0
dma-kmalloc-32         0      0     32  128    1 : tunables    0    0    0 : slabdata      0      0      0
dma-kmalloc-16         0      0     16  256    1 : tunables    0    0    0 : slabdata      0      0      0
dma-kmalloc-8          0      0      8  512    1 : tunables    0    0    0 : slabdata      0      0      0
dma-kmalloc-192        0      0    192   42    2 : tunables    0    0    0 : slabdata      0      0      0
dma-kmalloc-96         0      0     96   42    1 : tunables    0    0    0 : slabdata      0      0      0
kmalloc-8192         180    180   8192    4    8 : tunables    0    0    0 : slabdata     45     45      0
kmalloc-4096         636    720   4096    8    8 : tunables    0    0    0 : slabdata     90     90      0
kmalloc-2048        6498   6688   2048   16    8 : tunables    0    0    0 : slabdata    418    418      0
kmalloc-1024        4677   4800   1024   32    8 : tunables    0    0    0 : slabdata    150    150      0
kmalloc-512         9029   9056    512   32    4 : tunables    0    0    0 : slabdata    283    283      0
kmalloc-256        31542  31840    256   32    2 : tunables    0    0    0 : slabdata    995    995      0
kmalloc-192        16548  16548    192   42    2 : tunables    0    0    0 : slabdata    394    394      0
kmalloc-128         8449   8544    128   32    1 : tunables    0    0    0 : slabdata    267    267      0
kmalloc-96         20607  21462     96   42    1 : tunables    0    0    0 : slabdata    511    511      0
kmalloc-64         71408  75968     64   64    1 : tunables    0    0    0 : slabdata   1187   1187      0
kmalloc-32          5760   5760     32  128    1 : tunables    0    0    0 : slabdata     45     45      0
kmalloc-16         13824  13824     16  256    1 : tunables    0    0    0 : slabdata     54     54      0
kmalloc-8          45056  45056      8  512    1 : tunables    0    0    0 : slabdata     88     88      0
kmem_cache_node      551    576     64   64    1 : tunables    0    0    0 : slabdata      9      9      0
kmem_cache           256    256    256   32    2 : tunables    0    0    0 : slabdata      8      8      0

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM