简体   繁体   English

如何从崩溃的应用程序中释放大页面

[英]How to release hugepages from the crashed application

I have an application that uses hugepage and the application suddenly crashed due to some bug. 我有一个使用hugepage的应用程序,由于某些bug,应用程序突然崩溃了。 After crashing, since the application does not release the hugepage properly, the free hugepage number is not increased in sys filesystem. 崩溃后,由于应用程序没有正确释放hugepage,因此sys文件系统中的freepagepage号不会增加。

$ sudo cat /sys/kernel/mm/hugepages/hugepages-2048kB/free_hugepages 
0
$ sudo cat /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages 
1024

Is there a way to release the hugepages by force? 有没有办法用武力释放大页?

Sometimes need to check all directory that hugetlbfs has been mounted. 有时需要检查hugetlbfs已挂载的所有目录。 So, 所以,

  1. find mounted directory by command mount | grep huge 通过命令mount | grep huge找到安装目录 mount | grep huge . mount | grep huge

  2. check every directory except especially /dev/hugepages . 检查每个目录,特别是/dev/hugepages

  3. delete all 2M-sized files. 删除所有2M大小的文件。 (2M is the size of hugepage) (2M是巨页的大小)

Use ipcs -m to list the shared memory segments. 使用ipcs -m列出共享内存段。 Use ipcrm to remove the left over shared memory segments. 使用ipcrm删除剩余的共享内存段。

Edit on 06/24/2019: Ok, so, the above answer, while correct as far as it goes, was a bit brief. 编辑于06/24/2019:好的,所以,上面的答案,虽然它是正确的,但有点简短。 In particular, if you have a host with multiple DB instances, and only one is crashed how can you determine which (if any) memory segments should be cleaned up? 特别是,如果您有一个具有多个数据库实例的主机,并且只有一个崩溃了,您如何确定应清除哪些(如果有)内存段?

Well, this too, can be done. 嗯,这也可以。 For each running instance, connect w/ / as sysdba , then do oradebug setmypid (any pid will do, as all Oracle PIDs connect to the SGA). 对于每个正在运行的实例,将w / / as sysdba连接/ as sysdba ,然后执行oradebug setmypid (任何pid都可以,因为所有Oracle PID都连接到SGA)。 Then do oradebug ipc . 然后做oradebug ipc That will (hopefully) return IPC information written to the trace file . 这将(希望)返回IPC information written to the trace file So, go to the udump (or diag_dest) directory, and look for your trace file. 因此,转到udump(或diag_dest)目录,然后查找跟踪文件。 It will contain all the IPC information for the instance. 它将包含实例的所有IPC信息。 This will include ShmId . 这将包括ShmId Look through the file for the ShmId(s) that this instance is using. 查看该实例正在使用的ShmId文件。 Now look at the output of ipcs -m . 现在看看ipcs -m的输出。

When you have done that for all the running instances, any memory segment output by ipcs -m that shows non-zero memory allocation, and that you cannot account for in the oradebug ipc information from any running instance, must be the left over memory segments from the crashed instance. 对所有正在运行的实例执行此操作后, ipcs -m输出的任何内存段显示非零内存分配,并且您无法在任何正在运行的实例的oradebug ipc信息中进行说明,必须是剩余的内存段来自崩溃的实例。 Use ipcrm to remove it/them. 使用ipcrm将其删除。

When doing this on a host with multiple running instances, this can be a bit fraught. 在具有多个正在运行的实例的主机上执行此操作时,这可能有点令人担忧。 Please proceed with caution. 请谨慎行事。 You don't want to remove the SGA of a running instance! 您不想删除正在运行的实例的SGA!

Hope that helps.... 希望有所帮助....

HugeTLB can either be used for shared memory (and Mark J. Bobak's answer would deal with that) or the app mmaps files created in a hugetlb filesystem. HugeTLB既可以用于共享内存(Mark J. Bobak的答案可以解决这个问题),也可以用于在hugetlb文件系统中创建的应用程序mmaps文件。 If the app crashes without removing those files they survive and keep corresponding memory 'allocated'. 如果应用程序在没有删除这些文件的情况下崩溃,它们会存活并保持相应的内存“已分配”

Check hugeTLB filesystem and see if there are any leftover files from the app. 检查hugeTLB文件系统,看看该应用程序是否有任何剩余文件。 Removing them would release the memory. 删除它们会释放内存。

If you follow the instruction below, you can get rid of the allocated hugepages: 如果您按照下面的说明操作,您可以摆脱分配的大页面:

1) Let's check the hugepages which were free at restart 1)让我们检查重启时可以免费使用的大页面

dpdk@dpdkvm:~$ ls /mnt/huge/
empty

dpdk@dpdkvm:~/dpdk-1.8.0/examples/kni$ cat /proc/meminfo
...
HugePages_Total:     256
HugePages_Free:      256
...

2) Starting a dpdk application with wrong parameters, producing an error 2)使用错误的参数启动dpdk应用程序,产生错误

dpdk@dpdkvm:~/dpdk-1.8.0/examples/kni$ sudo ./build/kni -c 0x03 -n 2 -- -P -p 0x03 --config="(0,0,1),(1,0,1)"
...
EAL: Error - exiting with code: 1
  Cause: No supported Ethernet device found

3) When I check hugepages, there is not any free 3)当我检查大页面时,没有任何免费的

dpdk@dpdkvm:~/dpdk-1.8.0/examples/kni$ cat /proc/meminfo
...
HugePages_Total:     256
HugePages_Free:        0
...

4) Now, when I check the mounted hugepage directory, I can see the files which are not given back to OS by dpdk application. 4)现在,当我检查挂载的hugepage目录时,我可以看到dpdk应用程序没有返回给OS的文件。

dpdk@dpdkvm:~/dpdk-1.8.0/examples/kni$ ls /mnt/huge/
...
rtemap_0    rtemap_137  rtemap_176  rtemap_214  rtemap_253  rtemap_62
...

5) Finally, if you remove the files starting with rtemap, you can give the hugepages back 5)最后,如果删除以rtemap开头的文件,则可以返回大页面

dpdk@dpdkvm:~/dpdk-1.8.0/examples/kni$ sudo rm /mnt/huge/*
[sudo] password for dpdk:
dpdk@dpdkvm:~/dpdk-1.8.0/examples/kni$ cat /proc/meminfo
...
HugePages_Total:     256
HugePages_Free:      256
...

your hugetlb may be used by shared memory or mmap files. 你的hugetlb可能被共享内存或mmap文件使用。 try to remove the shared memories or umount the hugetlb fs 尝试删除共享内存或卸载hugetlb fs

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM