简体   繁体   中英

Prevent concurrent processing in NodeJS

I need NodeJS to prevent concurrent operations for the same requests. From what I understand, if NodeJS receives multiple requests, this is what happens:

REQUEST1 ---> DATABASE_READ
REQUEST2 ---> DATABASE_READ
DATABASE_READ complete ---> EXPENSIVE_OP() --> REQUEST1_END
DATABASE_READ complete ---> EXPENSIVE_OP() --> REQUEST2_END

This results in two expensive operations running. What I need is something like this:

REQUEST1 ---> DATABASE_READ
DATABASE_READ complete ---> DATABASE_UPDATE
DATABASE_UPDATE complete ---> REQUEST2 ---> DATABASE_READ ––> REQUEST2_END
                         ---> EXPENSIVE_OP() --> REQUEST1_END

This is what it looks like in code. The problem is the window between when the app starts reading the cache value and when it finishes writing to it. During this window, the concurrent requests don't know that there is already one request with the same itemID running.

app.post("/api", async function(req, res) {
    const itemID = req.body.itemID

    // See if itemID is processing
    const processing = await DATABASE_READ(itemID)
    // Due to how NodeJS works, 
    // from this point in time all requests
    // to /api?itemID="xxx" will have processing = false 
    // and will conduct expensive operations

    if (processing == true) {
        // "Cheap" part
        // Tell client to wait until itemID is processed
    } else {
        // "Expensive" part
        DATABASE_UPDATE({[itemID]: true})
        // All requests to /api at this point
        // are still going here and conducting 
        // duplicate operations.
        // Only after DATABASE_UPDATE finishes, 
        // all requests go to the "Cheap" part
        DO_EXPENSIVE_THINGS();
    }
}

Edit

Of course I can do something like this:

const lockedIDs = {}
app.post("/api", function(req, res) {
    const itemID = req.body.itemID
    const locked = lockedIDs[itemID] ? true : false // sync equivalent to async DATABASE_READ(itemID)
    if (locked) {
        // Tell client to wait until itemID is processed
        // No need to do expensive operations
    } else {
        lockedIDs[itemID] = true // sync equivalent to async DATABASE_UPDATE({[itemID]: true})
        // Do expensive operations
        // itemID is now "locked", so subsequent request will not go here
    }
}

lockedIDs here behaves like an in-memory synchronous key-value database. That is ok, if it is just one server. But what if there are multiple server instances? I need to have a separate cache storage, like Redis. And I can access Redis only asynchronously . So this will not work, unfortunately.

You could create a local Map object (in memory for synchronous access) that contains any itemID as a key that is being processed. You could make the value for that key be a promise that resolves with whatever the result is from anyone who has previously processed that key. I think of this like a gate keeper. It keeps track of which itemIDs are being processed.

This scheme tells future requests for the same itemID to wait and does not block other requests - I thought that was important rather than just using a global lock on all requests related to itemID processing.

Then, as part of your processing, you first check the local Map object. If that key is in there, then it's currently being processed. You can then just await the promise from the Map object to see when it's done being processed and get any result from prior processing.

If it's not in the Map object, then it's not being processed now and you can immediately put it in Map to mark it as "in process". If you set a promise as the value, then you can resolve that promise with whatever result you get from this processing of the object.

Any other requests that come along will end up just waiting on that promise and you will thus only process this ID once. The first one to start with that ID will process it and all other requests that come along while it's processing will use the same shared result (thus saving the duplication of your heavy computation).

I tried to code up an example, but did not really understand what your psuedo-code was trying to do well enough to offer a code example.

Systems like this have to have perfect error handling so that all possible error paths handle the Map and promise embedded in the Map properly.

Based on your fairly light pseudo-code example, here's a similar pseudo code example that illustrates the above concept:

const itemInProcessCache = new Map();

app.get("/api", async function(req, res) {
    const itemID = req.query.itemID
    let gate = itemInProcessCache.get(itemID);
    if (gate) {
        gate.then(val => {
            // use cached result here from previous processing
        }).catch(err => {
            // decide what to do when previous processing had an error
        });
    } else {
        let p = DATABASE_UPDATE({itemID: true}).then(result => {
            // expensive processing done
            // return final value so any others waiting on the gate can just use that value
            // decide if you want to clear this item from itemInProcessCache or not
        }).catch(err => {
            // error on expensive processing

            // remove from the gate cache because we didn't get a result
            // expensive processing will have to be done by someone else
            itemInProcessCache.delete(itemID);
        });
        // mark this item as being processed
        itemInProcessCache.set(itemID, p);
    }
});

Note: This relies on the single-threadedness of node.js. No other request can get started until the request handler here returns so that itemInProcessCache.set(itemID, p); gets called before any other requests for this itemID could get started.


Also, I don't know databases very well, but this seems very much like a feature that a good multi-user database might have built in or have supporting features that makes this easier since it's not an uncommon idea to not want to have multiple requests all trying to do the same database work (or worse yet, trouncing each other's work).

Ok, let me take a crack at this.

So, the problem I'm having with this question is that you've abstracted the problem so much that it's really hard to help you optimize. It's not clear what your "long running process" is doing, and what it is doing will affect how to solve the challenge of handling multiple concurrent requests. What's your API doing that you're worried about consuming resources?

From your code, at first I guessed that you're kicking off some kind of long-running job (eg file conversion or something), but then some of the edits and comments make me think that it might be just a complex query against the database which requires a lot of calculations to get right and so you want to cache the query results. But I could also see it being something else, like a query against a bunch of third party APIs that you're aggregating or something. Each scenario has some nuance that changes what's optimal.

That said, I'll explain the 'cache' scenario and you can tell me if you're more interested in one of the other solutions.

Basically, you're in the right ballpark for the cache already. If you haven't already, I'd recommend looking at cache-manager , which simplifies your boilerplate a little for these scenarios (and let's you set cache invalidation and even have multi-tier caching). The piece that you're missing is that you essentially should always respond with whatever you have in the cache, and populate the cache outside the scope of any given request. Using your code as a starting point, something like this (leaving off all the try..catches and error checking and such for simplicity):

// A GET is OK here, because no matter what we're firing back a response quickly, 
//      and semantically this is a query
app.get("/api", async function(req, res) {
    const itemID = req.query.itemID

    // In this case, I'm assuming you have a cache object that basically gets whatever
    //    is cached in your cache storage and can set new things there too.  
    let item = await cache.get(itemID)

    // Item isn't in the cache at all, so this is the very first attempt.  
    if (!item) {
        // go ahead and let the client know we'll get to it later. 202 Accepted should 
        //   be fine, but pick your own status code to let them know it's in process. 
        //   Other good options include [503 Service Unavailable with a retry-after 
        //   header][2] and [420 Enhance Your Calm][2] (non-standard, but funny)
        res.status(202).send({ id: itemID });

        // put an empty object in there so we know it's working on it. 
        await cache.set(itemID, {}); 

        // start the long-running process, which should update the cache when it's done
        await populateCache(itemID); 
        return;
    }
    // Here we have an item in the cache, but it's not done processing.  Maybe you 
    //     could just check to see if it's an empty object or not, but I'm assuming 
    //     that we've setup a boolean flag on the cached object for when it's done.
    if (!item.processed) {
        // The client should try again later like above.  Exit early. You could 
        //    alternatively send the partial item, an empty object, or a message. 
       return res.status(202).send({ id: itemID });
    } 

    // if we get here, the item is in the cache and done processing. 
    return res.send(item);
}

Now, I don't know precisely what all your stuff does, but if it's me, populateCache from above is a pretty simple function that just calls whatever service we're using to do the long-running work and then puts it into the cache.

async function populateCache(itemId) {
   const item = await service.createThisWorkOfArt(itemId);
   await cache.set(itemId, item); 
   return; 
}

Let me know if that's not clear or if your scenario is really different from what I'm guessing.

As mentioned in the comments, this approach will cover most normal issues you might have with your described scenario, but it will still allow two requests to both fire off the long-running process, if they come in faster than the write to your cache store (eg Redis). I judge the odds of that happening are pretty low, but if you're really concerned about that then the next more paranoid version of this would be to simply remove the long-running process code from your web API altogether. Instead, your API just records that someone requested that stuff to happen, and if there's nothing in the cache then respond as I did above, but completely remove the block that actually calls populateCache altogether.

Instead, you would have a separate worker process running that would periodically (how often depends on your business case) check the cache for unprocessed jobs and kick off the work for processing them. By doing it this way, even if you have 1000's of concurrent requests for the same item, you can ensure that you're only processing it one time. The downside of course is that you add whatever the periodicity of the check is to the delay in getting the fully processed data.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM