简体   繁体   English

进程内存不足时删除大型Javascript对象

[英]Deleting large Javascript objects when process is running out of memory

I'm a novice to this kind of javascript, so I'll give a brief explanation: 我是这种javascript的新手,所以我将给出一个简短的解释:

I have a web scraper built in Nodejs that gathers (quite a bit of) data, processes it with Cheerio (basically jQuery for Node ) creates an object then uploads it to mongoDB. 我有一个内置在Nodejs中的web scraper收集(相当多)数据,用Cheerio (基本上是jQuery for Node )处理它创建一个对象然后将其上传到mongoDB。

It works just fine, except for on larger sites. 它工作得很好,除了在较大的网站上。 What's appears to be happening is: 似乎正在发生的事情是:

  1. I give the scraper an online store's URL to scrape 我给刮刀一个在线商店的URL来刮
  2. Node goes to that URL and retrieves anywhere from 5,000 - 40,000 product urls to scrape 节点转到该URL并检索5,000到40,000个产品URL中的任何地方
  3. For each of these new URLs, Node's request module gets the page source then loads up the data to Cheerio . 对于这些新URL中的每一个,Node的request模块获取页面源,然后将数据加载到Cheerio
  4. Using Cheerio I create a JS object which represents the product. 使用Cheerio我创建了一个代表产品的JS对象。
  5. I ship the object off to MongoDB where it's saved to my database. 我将对象发送到MongoDB,并将其保存到我的数据库中。

As I say, this happens for thousands of URLs and once I get to, say, 10,000 urls loaded I get errors in node. 正如我所说,这发生在成千上万的URL上,一旦我到达,比如说,加载了10,000个url,我在节点中遇到错误。 The most common is: 最常见的是:

Node: Fatal JS Error: Process out of memory

Ok, here's the actual question(s): 好的,这是实际的问题:

I think this is happening because Node's garbage cleanup isn't working properly. 认为这是因为Node的垃圾清理工作不正常。 It's possible that, for example, the request data scraped from all 40,000 urls is still in memory, or at the very least the 40,000 created javascript objects may be. 例如,从所有40,000个URL中删除的request数据仍然可能在内存中,或者至少可以创建40,000个javascript对象。 Perhaps it's also because the MongoDB connection is made at the start of the session and is never closed (I just close the script manually once all the products are done). 也许这也是因为MongoDB连接是在会话开始时进行的,并且永远不会关闭(我只需在完成所有产品后手动关闭脚本)。 This is to avoid opening/closing the connection it every single time I log a new product. 这是为了避免每次登录新产品时打开/关闭连接。

To really ensure they're cleaned up properly (once the product goes to MongoDB I don't use it anymore and can be deleted from memory) can/should I just simply delete it from memory, simply using delete product ? 要真正确保它们被正确清理(一旦产品进入MongoDB我不再使用它并且可以从内存中删除)可以/我应该只是简单地从内存中删除它,只需使用delete product

Moreso (I'm clearly not across how JS handles objects) if I delete one reference to the object is it totally wiped from memory, or do I have to delete all of them? Moreso(我显然不是JS如何处理对象)如果我删除一个对象的引用是它完全从内存中删除,还是我必须删除所有这些?

For instance: 例如:

var saveToDB = require ('./mongoDBFunction.js');

function getData(link){
    request(link, function(data){
        var $ = cheerio.load(data);
        createProduct($)
    })
}

function createProduct($)   
    var product = {
        a: 'asadf',
        b: 'asdfsd'
        // there's about 50 lines of data in here in the real products but this is for brevity
    }    
    product.name = $('.selector').dostuffwithitinjquery('etc');
    saveToDB(product);
}

// In mongoDBFunction.js

exports.saveToDB(item){
    db.products.save(item, function(err){
        console.log("Item was successfully saved!");
        delete item; // Will this completely delete the item from memory?
    })
}

delete in javascript is NOT used to delete variables or free memory. 在javascript中delete不用于删除变量或空闲内存。 It is ONLY used to remove a property from an object. 它仅用于从对象中删除属性。 You may find this article on the delete operator a good read. 您可能会在delete运算符上找到这篇文章

You can remove a reference to the data held in a variable by setting the variable to something like null . 您可以通过将变量设置为null来删除对变量中保存的数据的引用。 If there are no other references to that data, then that will make it eligible for garbage collection. 如果没有其他对该数据的引用,那么这将使其符合垃圾收集的条件。 If there are other references to that object, then it will not be cleared from memory until there are no more references to it (eg no way for your code to get to it). 如果有对该对象的其他引用,那么在没有对它的引用之前它将不会从内存中清除(例如,代码无法访问它)。

As for what is causing the memory accumulation, there are a number of possibilities and we can't really see enough of your code to know what references could be held onto that would keep the GC from freeing up things. 至于导致内存累积的原因,有很多可能性,我们无法真正看到你的代码,知道可以保留哪些引用可以防止GC释放内容。

If this is a single, long running process with no breaks in execution, you might also need to manually run the garbage collector to make sure it gets a chance to clean up things you have released. 如果这是一个没有中断执行的单个长时间运行的进程,您可能还需要手动运行垃圾收集器以确保它有机会清理已释放的内容。

Here's are a couple articles on tracking down your memory usage in node.js: http://dtrace.org/blogs/bmc/2012/05/05/debugging-node-js-memory-leaks/ and https://hacks.mozilla.org/2012/11/tracking-down-memory-leaks-in-node-js-a-node-js-holiday-season/ . 以下是关于在node.js中跟踪内存使用情况的几篇文章: http//dtrace.org/blogs/bmc/2012/05/05/debugging-node-js-memory-leaks/https:// hacks .mozilla.org / 2012/11 / tracking-down-memory-leaks-in-node-js-a-node-js-holiday-season /

JavaScript has a garbage collector that automatically track which variable is "reachable". JavaScript有一个垃圾收集器,可以自动跟踪哪个变量是“可达”的。 If a variable is "reachable", then its value won't be released. 如果变量“可达”,则不会释放其值。

For example if you have a global variable var g_hugeArray and you assign it a huge array, you actually have two JavaScript object here: one is the huge block that holds the array data. 例如,如果你有一个全局变量var g_hugeArray并且你为它分配了一个巨大的数组,那么你实际上有两个JavaScript对象:一个是保存数组数据的巨大块。 Another is a property on the window object whose name is "g_hugeArray" that points to that data. 另一个是窗口对象上的属性,其名称为“g_hugeArray”,指向该数据。 So the reference chain is: window -> g_hugeArray -> the actual array. 所以引用链是:window - > g_hugeArray - >实际的数组。

In order to release the actual array, you make the actual array "unreachable". 为了释放实际的数组,可以使实际的数组“无法访问”。 you can break either link the above chain to achieve this. 你可以打破任何链接上面的链来实现这一目标。 If you set g_hugeArray to null, then you break the link between g_hugeArray and the actual array. 如果将g_hugeArray设置为null,则会断开g_hugeArray与实际数组之间的链接。 This makes the array data unreachable thus it will be released when the garbage collector runs. 这使得数组数据无法访问,因此它将在垃圾收集器运行时释放。 Alternatively, you can use "delete window.g_hugeArray" to remove property "g_hugeArray" from the window object. 或者,您可以使用“delete window.g_hugeArray”从窗口对象中删除属性“g_hugeArray”。 This breaks the link between window and g_hugeArray and also makes the actual array unreachable. 这打破了窗口和g_hugeArray之间的链接,也使得实际的数组无法访问。

The situation gets more complicated when you have "closures". 当你有“闭包”时,情况变得更加复杂。 A closure is created when you have a local function that reference a local variable. 当您具有引用局部变量的本地函数时,将创建闭包。 For example: 例如:

function a()
{
    var x = 10;
    var y = 20;
    setTimeout(function()
        {
            alert(x);
        }, 100);
}

In this case, local variable x is still reachable from the anonymous time out function even after function "a" has returned. 在这种情况下,即使在返回函数“a”之后,仍然可以从匿名超时函数到达局部变量x。 If without the timeout function, then both local variable x and y will become unreachable as soon as function a returns. 如果没有超时功能,则只要函数a返回,则局部变量x和y都将变为不可达。 But the existence of the anonymous function change this. 但是匿名函数的存在改变了这一点。 Depending on how the JavaScript engine is implemented, it may choose to keep both variable x and y (because it doesn't know whether the function will need y until the function actually runs, which occurs after function a returns). 根据JavaScript引擎的实现方式,它可以选择保留变量x和y(因为它不知道函数是否需要y,直到函数实际运行,这发生在函数a返回之后)。 Or if it is smart enough, it can only keep x. 或者,如果它足够聪明,它只能保持x。 Imagine that if both x and y points to big things, this can be a problem. 想象一下,如果x和y都指向大事,那么这可能是一个问题。 So closure is very convenient but at times it is more likely to cause memory issues and can make it more difficult to track memory issues. 因此,关闭非常方便,但有时它更有可能导致内存问题,并且可能使跟踪内存问题变得更加困难。

I faced same problem in my application with similar functionality. 我在具有类似功能的应用程序中遇到同样的问题。 I've been looking for memory leaks or something like that. 我一直在寻找内存泄漏或类似的东西。 The size of consumed memory my process has reached to 1.4 GB and depends on the number of links that must be downloaded. 我的进程消耗的内存大小已达到1.4 GB,并且取决于必须下载的链接数。

The first thing I noticed was that after manually running the Garbage Collector, almost all memory was freed. 我注意到的第一件事是在手动运行垃圾收集器之后,几乎所有内存都被释放了。 Each page that I downloaded took about 1 MB, was processed and stored in the database. 我下载的每个页面大约1 MB,已经处理并存储在数据库中。

Then I install heapdump and looked at the snapshot of the application. 然后我安装了heapdump并查看了应用程序的快照。 More information about memory profiling you can found at Webstorm Blog . 有关内存分析的更多信息,请访问Webstorm Blog

在此输入图像描述

My guess is that while the application is running, the GC does not start. 我的猜测是,当应用程序运行时,GC无法启动。 To do this, I began to run application with the flag --expose-gc , and began to run GC manually at the time of implementation of the program. 为此,我开始使用标志--expose-gc运行应用程序,并在程序实现时开始手动运行GC。

const runGCIfNeeded = (() => {
    let i = 0;
    return function runGCIfNeeded() {
        if (i++ > 200) {
            i = 0;

            if (global.gc) {
                global.gc();
            } else {
                logger.warn('Garbage collection unavailable. Pass --expose-gc when launching node to enable forced garbage collection.');
            }
        }
    };
})();

// run GC check after each iteration
checkProduct(product._id)
    .then(/* ... */)
    .finally(runGCIfNeeded)

Interestingly, if you do not use const , let , var , etc when you define something in the global scope, it seems be an attribute of the global object, and deleting returns true. 有趣的是,如果在全局范围内定义某些内容时不使用constletvar等,则它似乎是全局对象的属性,并且delete返回true。 This could cause it to be garbage collected. 这可能导致它被垃圾收集。 I tested it like this and it seems to have the intended impact on my memory usage, please let me know if this is incorrect or if you got drastically different results: 我这样测试它似乎对我的内存使用有预期的影响,请告诉我这是不正确的还是你得到了截然不同的结果:


x = [];
process.memoryUsage();
i = 0;
while(i<1000000) {
    x.push(10.5);
}
process.memoryUsage();
delete x
process.memoryUsage();

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 当没有什么东西应该那么大时,使用 nodejs 处理内存不足错误 - Process out of memory error using nodejs when nothing should be that large 使用xhr.onprogress来处理大型ajax下载而不会耗尽内存? - use xhr.onprogress to process large ajax download without running out of memory? Javascript - .map耗尽内存 - Javascript - .map running out of memory JavaScript代码内存不足 - Javascript code is running out of memory NodeJs进程在aws上耗尽内存 - NodeJs process running out of memory on aws NODE.JS:致命错误 - JS分配失败 - 在解析大型json对象时处理内存不足 - NODE.JS: FATAL ERROR- JS Allocation failed - process out of memory, while parsing large json objects NodeJS-内存不足:大数据处理时会杀死进程错误 - NodeJS - Out of Memory: Kill Process error on large data processing 在 JavaScript 或 TypeScript 中分配大量对象时如何避免 memory 碎片? - How to avoid memory fragment when allocate large number of objects in JavaScript or TypeScript? Javascript:它是否节省内存以定义仅在需要时调用的函数内的大对象? - Javascript: does it save memory to define large objects inside functions that are called only when needed? 如何处理大对象并避免出现内存不足错误? - How can I handle large objects and avoid an out of memory error?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM