简体   繁体   English

防止NodeJS中的并发处理

[英]Prevent concurrent processing in NodeJS

I need NodeJS to prevent concurrent operations for the same requests. 我需要NodeJS来防止相同请求的并发操作。 From what I understand, if NodeJS receives multiple requests, this is what happens: 据我了解,如果NodeJS收到多个请求,则会发生以下情况:

REQUEST1 ---> DATABASE_READ
REQUEST2 ---> DATABASE_READ
DATABASE_READ complete ---> EXPENSIVE_OP() --> REQUEST1_END
DATABASE_READ complete ---> EXPENSIVE_OP() --> REQUEST2_END

This results in two expensive operations running. 这导致运行两个昂贵的操作。 What I need is something like this: 我需要的是这样的:

REQUEST1 ---> DATABASE_READ
DATABASE_READ complete ---> DATABASE_UPDATE
DATABASE_UPDATE complete ---> REQUEST2 ---> DATABASE_READ ––> REQUEST2_END
                         ---> EXPENSIVE_OP() --> REQUEST1_END

This is what it looks like in code. 这就是代码中的样子。 The problem is the window between when the app starts reading the cache value and when it finishes writing to it. 问题出在应用开始读取缓存值和完成写入缓存之间的窗口。 During this window, the concurrent requests don't know that there is already one request with the same itemID running. 在此窗口中,并发请求不知道已经有一个正在运行相同itemID的请求。

app.post("/api", async function(req, res) {
    const itemID = req.body.itemID

    // See if itemID is processing
    const processing = await DATABASE_READ(itemID)
    // Due to how NodeJS works, 
    // from this point in time all requests
    // to /api?itemID="xxx" will have processing = false 
    // and will conduct expensive operations

    if (processing == true) {
        // "Cheap" part
        // Tell client to wait until itemID is processed
    } else {
        // "Expensive" part
        DATABASE_UPDATE({[itemID]: true})
        // All requests to /api at this point
        // are still going here and conducting 
        // duplicate operations.
        // Only after DATABASE_UPDATE finishes, 
        // all requests go to the "Cheap" part
        DO_EXPENSIVE_THINGS();
    }
}

Edit 编辑

Of course I can do something like this: 我当然可以做这样的事情:

const lockedIDs = {}
app.post("/api", function(req, res) {
    const itemID = req.body.itemID
    const locked = lockedIDs[itemID] ? true : false // sync equivalent to async DATABASE_READ(itemID)
    if (locked) {
        // Tell client to wait until itemID is processed
        // No need to do expensive operations
    } else {
        lockedIDs[itemID] = true // sync equivalent to async DATABASE_UPDATE({[itemID]: true})
        // Do expensive operations
        // itemID is now "locked", so subsequent request will not go here
    }
}

lockedIDs here behaves like an in-memory synchronous key-value database. 在这里, lockedIDs行为就像一个内存中同步键值数据库。 That is ok, if it is just one server. 如果它只是一台服务器,那就可以了。 But what if there are multiple server instances? 但是,如果有多个服务器实例怎么办? I need to have a separate cache storage, like Redis. 我需要有一个单独的缓存存储,例如Redis。 And I can access Redis only asynchronously . 而且我只能异步访问Redis。 So this will not work, unfortunately. 因此,不幸的是,这将行不通。

You could create a local Map object (in memory for synchronous access) that contains any itemID as a key that is being processed. 您可以创建一个本地Map对象(在内存中用于同步访问),该对象包含任何itemID作为正在处理的键。 You could make the value for that key be a promise that resolves with whatever the result is from anyone who has previously processed that key. 您可以使该密钥的值成为一个承诺,该承诺可以解决以前处理该密钥的任何人的结果。 I think of this like a gate keeper. 我认为这就像是守门员。 It keeps track of which itemIDs are being processed. 它跟踪正在处理的itemID。

This scheme tells future requests for the same itemID to wait and does not block other requests - I thought that was important rather than just using a global lock on all requests related to itemID processing. 该方案告诉将来等待相同itemID的请求,并且不会阻止其他请求-我认为这很重要,而不是仅对与itemID处理相关的所有请求使用全局锁定。

Then, as part of your processing, you first check the local Map object. 然后,作为处理的一部分,您首先要检查本地Map对象。 If that key is in there, then it's currently being processed. 如果该密钥在其中,则当前正在处理它。 You can then just await the promise from the Map object to see when it's done being processed and get any result from prior processing. 然后,您可以等待来自Map对象的promise,以查看何时完成处理并从先前的处理中获取任何结果。

If it's not in the Map object, then it's not being processed now and you can immediately put it in Map to mark it as "in process". 如果它不在Map对象中,则说明它现在不在处理中,您可以立即将其放在Map中以将其标记为“处理中”。 If you set a promise as the value, then you can resolve that promise with whatever result you get from this processing of the object. 如果将promise设置为值,则可以通过该对象处理得到的任何结果来解析该promise。

Any other requests that come along will end up just waiting on that promise and you will thus only process this ID once. 随之而来的任何其他请求都将仅在等待该诺言时结束,因此您将只处理一次该ID。 The first one to start with that ID will process it and all other requests that come along while it's processing will use the same shared result (thus saving the duplication of your heavy computation). 以该ID开头的第一个请求将对其进行处理,并且在处理该过程时出现的所有其他请求将使用相同的共享结果(从而节省了繁重的计算工作)。

I tried to code up an example, but did not really understand what your psuedo-code was trying to do well enough to offer a code example. 我试图编写一个示例,但并没有真正理解您的伪代码试图做的足够好以提供一个代码示例。

Systems like this have to have perfect error handling so that all possible error paths handle the Map and promise embedded in the Map properly. 这样的系统必须具有完美的错误处理,以便所有可能的错误路径都可以处理Map并保证正确嵌入Map

Based on your fairly light pseudo-code example, here's a similar pseudo code example that illustrates the above concept: 根据您相当轻巧的伪代码示例,下面是一个类似的伪代码示例,它说明了上述概念:

const itemInProcessCache = new Map();

app.get("/api", async function(req, res) {
    const itemID = req.query.itemID
    let gate = itemInProcessCache.get(itemID);
    if (gate) {
        gate.then(val => {
            // use cached result here from previous processing
        }).catch(err => {
            // decide what to do when previous processing had an error
        });
    } else {
        let p = DATABASE_UPDATE({itemID: true}).then(result => {
            // expensive processing done
            // return final value so any others waiting on the gate can just use that value
            // decide if you want to clear this item from itemInProcessCache or not
        }).catch(err => {
            // error on expensive processing

            // remove from the gate cache because we didn't get a result
            // expensive processing will have to be done by someone else
            itemInProcessCache.delete(itemID);
        });
        // mark this item as being processed
        itemInProcessCache.set(itemID, p);
    }
});

Note: This relies on the single-threadedness of node.js. 注意:这依赖于node.js的单线程。 No other request can get started until the request handler here returns so that itemInProcessCache.set(itemID, p); 在这里请求处理程序返回之前,没有其他请求可以开始,因此itemInProcessCache.set(itemID, p); gets called before any other requests for this itemID could get started. 在对该itemID的任何其他请求开始之前被调用。


Also, I don't know databases very well, but this seems very much like a feature that a good multi-user database might have built in or have supporting features that makes this easier since it's not an uncommon idea to not want to have multiple requests all trying to do the same database work (or worse yet, trouncing each other's work). 另外,我也不是很了解数据库,但是这似乎很像一个好的多用户数据库可能内置的功能,或者具有使之更容易的支持功能,因为不想有多个数据库不是一个不常见的想法请求所有尝试做相同数据库工作的人(或更糟糕的是,互相挫败对方的工作)。

Ok, let me take a crack at this. 好吧,让我对此付诸行动。

So, the problem I'm having with this question is that you've abstracted the problem so much that it's really hard to help you optimize. 因此,我对此问题的困扰是您对问题的抽象如此之多,以至于很难帮助您进行优化。 It's not clear what your "long running process" is doing, and what it is doing will affect how to solve the challenge of handling multiple concurrent requests. 目前尚不清楚您的“长期运行的流程”在做什么,它在做什么将影响如何解决处理多个并发请求的挑战。 What's your API doing that you're worried about consuming resources? 您担心消耗资源的API在做什么?

From your code, at first I guessed that you're kicking off some kind of long-running job (eg file conversion or something), but then some of the edits and comments make me think that it might be just a complex query against the database which requires a lot of calculations to get right and so you want to cache the query results. 从您的代码开始,我首先猜想您正在开展某种长期运行的工作(例如文件转换等),但是随后的一些编辑和注释使我认为这可能只是针对数据库,需要进行大量计算才能正确计算,因此您想缓存查询结果。 But I could also see it being something else, like a query against a bunch of third party APIs that you're aggregating or something. 但是我也可以看到它是另外一回事,例如针对您正在聚合的一堆第三方API的查询或其他内容。 Each scenario has some nuance that changes what's optimal. 每个方案都有一些细微差别,可以改变最佳方案。

That said, I'll explain the 'cache' scenario and you can tell me if you're more interested in one of the other solutions. 也就是说,我将解释“缓存”场景,您可以告诉我是否对其他解决方案之一更感兴趣。

Basically, you're in the right ballpark for the cache already. 基本上,您已经在缓存的正确位置。 If you haven't already, I'd recommend looking at cache-manager , which simplifies your boilerplate a little for these scenarios (and let's you set cache invalidation and even have multi-tier caching). 如果您还没有的话,我建议您看一下cache-manager ,它在这些情况下可以简化您的样板(让我们设置缓存失效甚至具有多层缓存)。 The piece that you're missing is that you essentially should always respond with whatever you have in the cache, and populate the cache outside the scope of any given request. 您缺少的部分是,您基本上应该始终使用缓存中的内容进行响应,并将缓存填充到任何给定请求范围之外。 Using your code as a starting point, something like this (leaving off all the try..catches and error checking and such for simplicity): 使用您的代码作为起点,类似以下内容(省去了所有try..catches和错误检查等),以简化操作:

// A GET is OK here, because no matter what we're firing back a response quickly, 
//      and semantically this is a query
app.get("/api", async function(req, res) {
    const itemID = req.query.itemID

    // In this case, I'm assuming you have a cache object that basically gets whatever
    //    is cached in your cache storage and can set new things there too.  
    let item = await cache.get(itemID)

    // Item isn't in the cache at all, so this is the very first attempt.  
    if (!item) {
        // go ahead and let the client know we'll get to it later. 202 Accepted should 
        //   be fine, but pick your own status code to let them know it's in process. 
        //   Other good options include [503 Service Unavailable with a retry-after 
        //   header][2] and [420 Enhance Your Calm][2] (non-standard, but funny)
        res.status(202).send({ id: itemID });

        // put an empty object in there so we know it's working on it. 
        await cache.set(itemID, {}); 

        // start the long-running process, which should update the cache when it's done
        await populateCache(itemID); 
        return;
    }
    // Here we have an item in the cache, but it's not done processing.  Maybe you 
    //     could just check to see if it's an empty object or not, but I'm assuming 
    //     that we've setup a boolean flag on the cached object for when it's done.
    if (!item.processed) {
        // The client should try again later like above.  Exit early. You could 
        //    alternatively send the partial item, an empty object, or a message. 
       return res.status(202).send({ id: itemID });
    } 

    // if we get here, the item is in the cache and done processing. 
    return res.send(item);
}

Now, I don't know precisely what all your stuff does, but if it's me, populateCache from above is a pretty simple function that just calls whatever service we're using to do the long-running work and then puts it into the cache. 现在,我不知道您的全部工作是什么,但是如果是我,那么populateCache是一个非常简单的函数,它仅调用我们正在使用的服务来执行长时间运行的工作,然后将其放入缓存中。

async function populateCache(itemId) {
   const item = await service.createThisWorkOfArt(itemId);
   await cache.set(itemId, item); 
   return; 
}

Let me know if that's not clear or if your scenario is really different from what I'm guessing. 让我知道是否不清楚,或者您的情况是否与我的猜测确实不同。

As mentioned in the comments, this approach will cover most normal issues you might have with your described scenario, but it will still allow two requests to both fire off the long-running process, if they come in faster than the write to your cache store (eg Redis). 如评论中所述,这种方法将涵盖您所描述的方案可能遇到的大多数正常问题,但是,如果它们的执行速度比写入高速缓存存储的速度快,它仍将允许两个请求同时触发长时间运行的进程(例如Redis)。 I judge the odds of that happening are pretty low, but if you're really concerned about that then the next more paranoid version of this would be to simply remove the long-running process code from your web API altogether. 我认为发生这种情况的几率很低,但是如果您真的对此感到担心,那么下一个更偏执的版本将是从Web API中完全删除长时间运行的过程代码。 Instead, your API just records that someone requested that stuff to happen, and if there's nothing in the cache then respond as I did above, but completely remove the block that actually calls populateCache altogether. 相反,您的API只是记录有人请求该事件发生,并且如果高速缓存中没有任何内容,则像我上面所做的那样做出响应,但完全删除实际调用populateCache的块。

Instead, you would have a separate worker process running that would periodically (how often depends on your business case) check the cache for unprocessed jobs and kick off the work for processing them. 相反,您将运行一个单独的工作进程,该工作进程将定期(取决于您的业务情况)检查缓存中是否有未处理的作业,并启动处理这些作业的工作。 By doing it this way, even if you have 1000's of concurrent requests for the same item, you can ensure that you're only processing it one time. 通过这种方式,即使您对同一项目有1000个并发请求,也可以确保只处理一次。 The downside of course is that you add whatever the periodicity of the check is to the delay in getting the fully processed data. 当然,不利的一面是您将检查的周期性添加到获取完全处理的数据的延迟中。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM