简体   繁体   中英

How do You monitor sidekiq processes?

I'm working on a production app that has multiple rails servers behind nginx loadbalancer. We are monitoring sidekiq processes with monit, and it works just fine - when sidekiq proces dies monit starts it right back.

However recently encountered a situation where one of these processes was running and visible to monit, but for some reason not visible to sidekiq. That resulted in many failed jobs and took us some time to notice that we're missing one process in sidekiq Web UI, since monit was telling us everything was fine and all processes were running. Simple restart fixed the problem.

And that bring me to my question: how do you monitor your sidekiq processes? I know i can use something like rollbar to notify me when jobs fail, but i'd like to know if there is a way to monitor process count and preferably send mail when one dies. Any suggestions?

Something that would ping sidekiq/stats and verify response.

My approach is the following:

  1. create a background job that does something
  2. call the job regularly
  3. check that the thing is being done!

so; using a cron script (or something like whenever ) every 5 mins, I run :

CheckinJob.perform_later

It's now up to sidekiq (or delayed_job, or whatever active job you're using) to actually run the job.

The job just has to do something which you can check.

I used to get the job to update a record in my Status table (essentially a list of key/value records). Then I'd have a /status page which returns a :500 status code if the record hasn't been updated in the last 6 minutes.

(obviously your timing may vary)

Then I use a monitoring service to monitor the status page! (something like StatusCake )

Nowdays I have a simpler approach; I just get the background job to check in with a cron monitoring service like

The monitoring service which expects your task to check in every X mins. If your task doesn't check in - then the monitoring service will let you know.

Integration is dead simple for all the services. For Is It Working it would be:

IsItWorkingInfo::Checkin.ping(key:"CHECKIN_IDENTIFIER")

full disclosure: I wrote IsItWorking !

My super simple solution to a similar problem looks like this:

# sidekiq_check.rb
namespace :sidekiq_check do
  task rerun: :environment do
    if Sidekiq::ProcessSet.new.size == 0
      exec 'bundle exec sidekiq -d -L log/sidekiq.log -C config/sidekiq.yml -e production'
    end
  end
end

and then using cron/ whenever

# schedule.rb
every 5.minutes do 
  rake 'sidekiq_check:rerun'
end

We ran into this problem where our sidekiq processes had stopped working off jobs overnight and we had no idea. It took us about 30 minutes to integrate http://deadmanssnitch.com by following these instructions .

It's not the prettiest or cheapest option but it gets the job done (integrates nicely with Pagerduty) and has saved our butt twice in the last few months.

On of our complaints with the service is the shortest grace interval we can set is 15 minutes which is too long for us. So we're evaluating similar services like Healthchecks, etc.

I use god gem to monitor my sidekiq processes. God gem makes sure that your process is always running and also can notify the process status on various channels.

ROOT = File.dirname(File.dirname(__FILE__))
God.pid_file_directory = File.join(ROOT, "tmp/pids")

God.watch do |w|
  w.env = {'RAILS_ENV' => ENV['RAILS_ENV'] || 'development'}
  w.name = 'sidekiq'
  w.start = "bundle exec sidekiq -d -L log/sidekiq.log -C config/sidekiq.yml -e #{ENV['RAILS_ENV']}"
  w.log = "#{ROOT}/log/sidekiq_god.log"
  w.behavior(:clean_pid_file)
  w.dir = ROOT
  w.keepalive

  w.restart_if do |restart|
    restart.condition(:memory_usage) do |c|
      c.interval = 120.seconds
      c.above = 100.megabytes
      c.times = [3, 5] # 3 out of 5 intervals
    end

    restart.condition(:cpu_usage) do |c|
      c.interval = 120.seconds
      c.above = 80.percent
      c.times = 5
    end
  end

  w.lifecycle do |on|
    on.condition(:flapping) do |c|
      c.to_state = [:start, :restart]
      c.times = 5
      c.within = 5.minute
      c.transition = :unmonitored
      c.retry_in = 10.minutes
      c.retry_times = 5
      c.retry_within = 1.hours
    end
  end
end

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM