简体   繁体   English

node.js处理大量数据

[英]node.js process a big collection of data

I'm working with mongoose in node. 我在节点上使用猫鼬。 I'm doing requests to retrieve a collection of items from a remote database. 我正在请求从远程数据库检索项目集合。 In order to get a full report, I need to parse a whole collection which is a large set. 为了获得完整的报告,我需要分析一个大集合的整个集合。

I avoid to get close to things like: 我避免接近以下内容:

model.find({}, function(err, data) {
  // process the bunch of data
})

For now, I use a recursive approach in which I feed a local variable. 现在,我使用递归方法来输入局部变量。 Later I send back information about the process as a response. 稍后,我将有关该过程的信息作为响应发送回去。

app.get('/process/it/',(req,res)=>{

  var processed_data=[];

  function resolve(procdata) {
    res.json({status:"ok", items:procdata.length});
  }

  function handler(data, procdata, start, n) { 
    if(data.length <= n)    
      resolve(procdata);
    else {
      // do something with data: push into processed_data
      procdata.push(whatever);

      mongoose.model('model').find({}, function(err, data){     
        handler(data, procdata, start+n, n);    
      }).skip(start).limit(n);
    }
  }

  n=0
  mysize=100

  // first call
  mongoose.model('model').find({}, function(err, data){ 
    handler(data, processed_data, n, mysize);

  }).skip(n).limit(mysize);

})

Is there any approach or solution providing any performance advantage, or just, to achieve this in a better way? 是否有任何方法或解决方案提供了性能优势,或者仅仅是以更好的方式实现了这一优势?

Any help would be appreciated. 任何帮助,将不胜感激。

Solution depends on the use case. 解决方案取决于用例。

If data once processed doesn't change often, you can maybe have a secondary database which has the processed data. 如果曾经处理过的数据不经常更改,那么您可能拥有一个包含已处理数据的辅助数据库。

You can load unprocessed data from the primary database using pagination the way your doing right now. 您可以按照现在的方式使用分页从主数据库加载未处理的数据。 And all processed data can be loaded from the secondary database in a single query. 并且所有处理的数据都可以在单个查询中从辅助数据库加载。

It is fine as long as your data set is not big enough, performance could possibly be low though. 只要您的数据集不够大就可以,但是性能可能会很低。 When it gets to gigabyte level, your application will simply break because the machine won't have enough memory to store your data before sending it to client. 当达到千兆字节级别时,您的应用程序将简单地中断,因为计算机在将数据发送到客户端之前没有足够的内存来存储您的数据。 Also sending gigabytes of report data will take a lot of time too. 发送千兆字节的报告数据也将花费大量时间。 Here some suggestions: 这里有一些建议:

  • Try aggregating your data by Mongo aggregate framework, instead of doing that by your application code 尝试通过Mongo聚合框架聚合数据,而不是通过应用程序代码进行聚合
  • Try to break the report data into smaller reports 尝试将报告数据分成较小的报告
  • Pre-generating report data, store it somewhere (another collection perhaps), and simply send to client when they need to see it 预生成报告数据,将其存储在某个地方(也许是另一个集合),并在需要查看时将其发送给客户端

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM