简体   繁体   中英

Ejabberd is using all available CPU, how to debug

I have problems with my ejabberd installation and i am struggling to figure out what is going on.

After a few minutes (15-20 minutes) my CPU usage spikes to 100%. No aparent reason I can find. And from there on it stays flat out full CPU. I have tried to upgrade the hardware of the server but still I cannot get it to handle the load. The server is a quite modern one with Xeon process KVM virtualized. 8 cores and 32GB RAM, no other workloads.

I have tried to run etop but that does not work:

root@collaboration:/#./usr/lib/erlang/lib/observer-2.9.4/priv/bin/etop -node ejabberd@localhost Erlang/OTP 23 [erts-11.0.3] [source] [64-bit] [smp:8:8] [ds:8:8:10] [async-threads:1]

Eshell V11.0.3 (abort with ^G) (etop@collaboration)1> {"init terminating in do_boot",{{badmatch,{error,nxdomain}},[{etop_tr,reader,1,[{file,"etop_tr.erl"},{line,62}]},{etop,init_data_handler,1,[{file,"etop.erl"},{line,146}]},{etop,start,1,[{file,"etop.erl"},{line,129}]},{init,start_em,1,[]},{init,do_boot,3,[]}]}} init terminating in do_boot ({{badmatch,{error,nxdomain}},[{etop_tr,reader,1,[{ },{ }]},{etop,init_data_handler,1,[{ },{ }]},{etop,start,1,[{ },{ }]},{init,start_em,1,[]},{init,do_boot,3,[]}]})

Crash dump is being written to: erl_crash.dump...done

My error log has many entries of strange content. I suspect basically my database is not in a healthy state. The DB is 10 years old with many upgrades so there is high probability of problems. Downloadable error.log here: https://fil.email/u1U0Y1wu

Pastebin extracts from error.log: https://pastebin.com/umpf51aU

Recently I upgraded to ejabberd 20.07, and I have tried to apply all the MySQL schema updates etc. This cannot have worked as well as I hoped because there are traces of problems in the logs. This one here at least fails: https://docs.ejabberd.im/admin/upgrade/from_19.05_to_19.08/

root@:~# mysql -u ejabberd ejabberd -p << EOF

ALTER TABLE users MODIFY server_host varchar(191) NOT NULL; ALTER TABLE last MODIFY server_host varchar(191) NOT NULL; ALTER TABLE rosterusers MODIFY server_host varchar(191) NOT NULL; ALTER TABLE rostergroups MODIFY server_host varchar(191) NOT NULL; ALTER TABLE sr_group MODIFY server_host varchar(191) NOT NULL; ALTER TABLE sr_user MODIFY server_host varchar(191) NOT NULL; ALTER TABLE spool MODIFY server_host varchar(191) NOT NULL; ALTER TABLE archive MODIFY server_host varchar(191) NOT NULL; ALTER TABLE archive_prefs MODIFY server_host varchar(191) NOT NULL; ALTER TABLE vcard MODIFY server_host varchar(191) NOT NULL; ALTER TABLE vcard_search MODIFY server_host varchar(191) NOT NULL; ALTER TABLE privacy_default_list MODIFY server_host varchar(191) NOT NULL; ALTER TABLE privacy_list MODIFY server_host varchar(191) NOT NULL; ALTER TABLE private_storage MODIFY server_host varchar(191) NOT NULL; ALTER TABLE roster_version MODIFY server_host varchar(191) NOT NULL; ALTER TABLE muc_room MODIFY server_host varchar(191) NOT NULL; ALTER TABLE muc_registered MODIFY server_host varchar(191) NOT NULL; ALTER TABLE muc_online_room MODIFY server_host varchar(191) NOT NULL; ALTER TABLE muc_online_users MODIFY server_host varchar(191) NOT NULL; ALTER TABLE motd MODIFY server_host varchar(191) NOT NULL; ALTER TABLE sm MODIFY server_host varchar(191) NOT NULL; ALTER TABLE route MODIFY server_host varchar(191) NOT NULL; ALTER TABLE push_session MODIFY server_host varchar(191) NOT NULL; ALTER TABLE mix_pam MODIFY server_host varchar(191) NOT NULL; EOF Enter password: ERROR 1054 (42S22) at line 1: Unknown column 'server_host' in 'users'

Since I am a litte lost as to why we are having all the CPU issues I am contemplating dropping the database and importing a backup on a fresh installed server. How would I go about exporting as much healthy data as possible and importing this into a new database? Preferrably do an export of users with passwords and rosters as a minimum. There are no MUC rooms or similar. If possible SSL certs (ACME) should be migrated as letsencrypt is not too happy with new certs being requested all the time. If you have any type of guidance on this issue I would be very happy!

Just a FYI with the above log and load I have 155 users online, 12500 registered users.

From your logs:

exception exit: {undef,
                    [{xmpp_stream_out,stop_async,[<0.4108.0>],[]},

Here erlang reports that there is a function undefined (not defined in the source code).

Looking at the sources, that function was defined in xmpp 1.4.6: https://github.com/processone/xmpp/commit/c23e66ebac8fdec4aa08c8926091b0dcf6dacf22

And its usage was added to ejabberd in ejabberd 20.04 https://github.com/processone/ejabberd/commit/1bd560f3f25d0a644bac3d06904ca97e20a6f7d9

So, initially it seems as if you are running ejabberd 20.04 or newer, but using a version of xmpp library older than 1.4.6

Based on @Badlop response the problem was solved by installing new erlang-p1-xmpp. For some reason apt has a dependency problem thinking the installed package was newer than the one in unstable repository.

root@collaboration:~/download# dpkg -i erlang-p1-xmpp_1.4.9-1_amd64.deb dpkg: warning: downgrading erlang-p1-xmpp from 1:1.2.8-0.1~afa100 to 1.4.9-1 (Reading database... 105425 files and directories currently installed.) Preparing to unpack erlang-p1-xmpp_1.4.9-1_amd64.deb... Unpacking erlang-p1-xmpp (1.4.9-1) over (1:1.2.8-0.1~afa100)... Setting up erlang-p1-xmpp (1.4.9-1)...

I will see if I can make a bug report for the rep in debian to fix this issue.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM