I have a complex system that involves many Resque workers, jobs, and a monitoring process. The jobs have parent-child dependencies and these run through a series of states (using state-machine ), which is the reason for the monitoring process. We depend on the database state to ensure that cross-process tracking is in sync.
Here's a rough idea:
class ParentMonitor < ActiveRecord::Base
has_many children, class: ChildMonitor
state_machine :state, initial: :work_needed do
event :succeed do
transition :work_needed => :work_succeeded
end
event :fail do
transition :work_needed => :work_failed
end
end
def child_transition
return if children.any? { |child| child.work_needed? }
if children.any? { |child| child.work_succeeded? }
succeed
else
fail
end
end
end
class ChildMonitor < ActiveRecord::Base
belongs_to: owner, class: ParentMonitor
state_machine :state, initial: :work_needed do
event :succeed do
transition :work_needed => :work_succeeded
end
after_transition :to => :work_succeeded, :do => :notify_owner
event :fail do
transition :work_needed => :work_failed
end
after_transition :to => :work_failed, :do => :notify_owner
end
def notify_owner
owner.child_transition
end
end
What's happening is that for the first few such jobs (say a dozen or two out of several hundred), the ParentMonitors are being left in the work_needed
state even though all children are either in work_succeeded
or work_failed
. Through tracing and testing I've determined that each the time ParentMonitor#child_transition
is called, the list of children in "work needed" state has successively been reduced until at some point it makes a database load and replaces all the children with values of "work needed". Even though some had previously been completed.
In addition I don't see any UPDATE
logs in the log file for these first few children until it suddenly starts logging the updates. That logging is simultaneous with when it seems to reset the states of all its children.
It makes me think that the changes are all happening in memory due to some cached state, but I've added reload
, save
and find
calls throughout and they don't seem to effect change. I've also tried wrapping these calls in uncache
but that doesn't help.
As it turned out this was caused by the fact that the writes were held in a long-running transaction because the state-machine gem holds open a transaction between the state change and the end of any after
hooks. We had written hooks that ran for hours on the main monitoring loop.
We resolved this by performing the actions between state changes rather than in callbacks.
Incidentally, the erroneous behavior is exactly as described in the latest Red Book as a side-effect of "weak isolation" concurrency implemented in most RDMBS's:
example anomalies include reading intermediate data that another transaction produced, reading aborted data, reading two or more different values for the same item during execution of the same transaction, and “losing” some effects of transactions due to concurrent writes to the same item
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.