简体   繁体   中英

Implementing a Timer in Python

General Overview

  • I have medium size django project
  • I have a bunch of prefix trees in memory (as opposed to DB)
  • The nodes of these trees represent entities/objects that are subject to a timeout. Ie, I need to timeout these nodes at various points in time

Design:

  • Essentially, I needed a Timer construct that allows me to fire a resettable 1-shot timer and associate and give it a callback that can can perform some operation on the entity creating the timer, which in this case is a node of the tree.

After looking through the various options, I couldn't find anything that I could natively use (like some django app). The Timer object in Python is not suitable for this since it won't scale/perform. Thus I decided to write my own timer based on:

  1. A sorted list of time-delta objects that holds the time-horizon
  2. A mechanism to trigger the "tick"

Implementation Choices:

  1. Went with a wrapper around Bisect for the sorted delta list: http://code.activestate.com/recipes/577197-sortedcollection/
  2. Went with celery to provide the tick - A granularity of 1 minute, where the worker would trigger the timer_tick function provided by my Timer class. The timer_tick essentially should go through the sorted list, decrementing the head node every tick. Then if any nodes have ticked down to 0, kick off the callback and remove those nodes from the sorted timer list.
  3. Creating a timer involves instantiating a Timer object which returns the id of the object. This id is stored in db and associated with an entry in DB that represents the entity creating the timer

Additional Data Structures: In order to track the Timer instances (which get instantiated for each timer creation) I have a WeakRef Dictionary that maps the id to obj

So essentially, I have 2 data-structures in memory of my main Django project.

Problem Statement:

Since the celery worker needs to walk the timer list and also potentially modify the id2obj map, looks like I need to find a way to share state between my celery worker and main

Going through SO/Google, I find the following suggestions

  1. Manager
  2. Shared Memory

Unfortunately, bisect wrapper doesn't lend itself very well to pickling and/or state sharing. I tried the Manager approach by creating a dict and trying to embed the sorted List within the Dict..it came out with an error (kind of expected I guess since the memory held by the Sorted List is not shared and embedding it within a "shared" memory object will not work)

Finally...Question:

  1. Is there a way I can share my SortedCollection and Weakref Dict with the worker thread

Alternate solution:

How about keeping the worker thread simple...having it write to DB for every tick and then using a post Db signal to get notified on the main and execute the processing of expired timers in the main. Of course, the con is that I lose parallelisation.

Let's start with some comments on your existing implementation:

Went with a wrapper around Bisect for the sorted delta list: http://code.activestate.com/recipes/577197-sortedcollection/

While this gives you O(1) pops (as long as you keep the list in reverse time order), it makes each insert O(N) (and likewise for less common operations like deleting arbitrary jobs if you have a "cancel" API). Since you're doing exactly as many inserts as pops, this means the whole thing is algorithmically no better than an unsorted list.

Replacing this with a heapq (that's exactly what they're for) gives you O(log N) inserts. (Note that Python's heapq doesn't have a peek , but that's because heap[0] is equivalent to heap.peek(0) , so you don't need it.)

If you need to make other operations (cancel, iterate non-destructively, etc.) O(log N) as well, you want a search tree; look at blist and bintrees on PyPI for some good ones.


Went with celery to provide the tick - A granularity of 1 minute, where the worker would trigger the timer_tick function provided by my Timer class. The timer_tick essentially should go through the sorted list, decrementing the head node every tick. Then if any nodes have ticked down to 0, kick off the callback and remove those nodes from the sorted timer list.

It's much nicer to just keep the target times instead of the deltas. With target times, you just have to do this:

while q.peek().timestamp <= now():
    process(q.pop())

Again, that's O(1) rather than O(N), and it's a lot simpler, and it treats the elements on the queue as immutable, and it avoids any possible problems with iterations taking longer than your tick time (probably not a problem with 1-minute ticks…).


Now, on to your main question:

Is there a way I can share my SortedCollection

Yes. If you just want a priority heap of (timestamp, id) pairs, you can fit that into a multiprocessing.Array just as easily as a list , except for the need to keep track of length explicitly. Then you just need to synchronize every operation, and… that's it.

If you're only ticking once/minute, and you expect to be busy more often than not, you can just use a Lock to synchronize, and have the schedule-worker(s) tick itself.

But honestly, I'd drop the ticks completely and just use a Condition —it's more flexible, and conceptually simpler (even if it's a bit more code), and it means you're using 0% CPU when there's no work to be done and responding quickly and smoothly when you're under load. For example:

def schedule_job(timestamp, job):
    job_id = add_job_to_shared_dict(job) # see below
    with scheduler_condition:
        scheduler_heap.push((timestamp, job))
        scheduler_condition.notify_all()

def scheduler_worker_run_once():
    with scheduler_condition:
        while True:
            top = scheduler_heap.peek()
            if top is not None:
                delay = top[0] - now()
                if delay <= 0:
                    break
                scheduler_condition.wait(delay)
            else:
                scheduler_condition.wait()
        top = scheduler_heap.pop()
        if top is not None:
            job = pop_job_from_shared_dict(top[1])
            process_job(job)

Anyway, that brings us to the weakdict full of jobs.

Since a weakdict is explicitly storing references to in-process objects, it doesn't make any sense to share it across processes. What you want to store are immutable objects that define what the jobs actually are, not the mutable jobs themselves. Then it's just a plain old dict.

But still, a plain old dict is not an easy thing to share across processes.

The easy way to do that is to use a dbm database (or a shelve wrapper around one) instead of an in-memory dict , synchronized with a Lock . But this means re-flushing and re-opening the database every time anyone wants to change it, which may be unacceptable.

Switching to, say, a sqlite3 database may seem like overkill, but it may be a whole lot simpler.

On the other hand… the only operations you actually have here are "map the next id to this job and return the id" and "pop and return the job specified by this id". Does that really need to be a dict? The keys are integers, and you control them. An Array , plus a single Value for the next key, and a Lock , and you're almost done. The problem is that you need some kind of scheme for key overflow. Instead of just next_id += 1 , you have to roll over, and check for already-used slots:

with lock:
    next_id += 1
    if next_id == size: next_id = 0
    if arr[next_id] is None:
        arr[next_id] = job
        return next_id

Another option is to just store the dict in the main process, and use a Queue to make other processes query it.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM