基于另一个数据集过滤大型 JSON 数据集

Question

I have a large JSON Dataset A (180,000 records) containing user's complete records and another JSON Dataset B (which is a subset of A) containing only some user's unique ID and name (about 1,500 records).我有一个包含用户完整记录的大型 JSON 数据集 A（180,000 条记录）和另一个 JSON 数据集 B（它是 A 的子集）仅包含一些用户的唯一 ID 和名称（大约 1,500 条记录）。 I need to get the complete records for the users in Dataset B from Dataset A.我需要从数据集 A 中获取数据集 B 中用户的完整记录。

Here is I've tried so far这是我迄今为止尝试过的

let detailedSponsoreApplicants = [];
let j;
        for(j=0; j < allApplicants.length; j++){
            let a = allApplicants[j];

            let i;
            for(i=0; i < sponsoredApplicants.length;; i++){
                let s = sponsoredApplicants[i];
                if (s && s.number === a.applicationNumber) {
                    detailedSponsoreApplicants.push(a);
                }else{                
                    if(s){
                        logger.warn(`${s.number} not found in master list`);
                    }
                }
            }

        }

The problem with the above code is that at some point I get the error FATAL ERROR: Ineffective mark-compacts near heap limit Allocation failed - JavaScript heap out of memory上面代码的问题是，在某些时候我收到错误FATAL ERROR: Ineffective mark-compacts near heap limit Allocation failed - JavaScript heap out of memory

So, how do I efficiently achieve the task without the errors.那么，我如何在没有错误的情况下有效地完成任务。

EDIT - SAMPLE JSON编辑 - 示例 JSON

Dataset A
{
  "applicationNumber": "3434343"
  "firstName": "dcds",
  "otherNames": "sdcs",
  "surname": "sdcs"
  "phone": "dscd",
  .
  .
  .
  "stateOfOrigin": "dcsd"
}

Dataset B
{
    "number": "3434343",
    "fullName": "dcds sdcs sdcs"
}

Answer 1

Try giving node more memory to work with:尝试为节点提供更多内存以供使用：

node --max-old-space-size=1024 index.js #increase to 1gb
node --max-old-space-size=2048 index.js #increase to 2gb
node --max-old-space-size=3072 index.js #increase to 3gb
node --max-old-space-size=4096 index.js #increase to 4gb
node --max-old-space-size=5120 index.js #increase to 5gb
node --max-old-space-size=6144 index.js #increase to 6gb
node --max-old-space-size=7168 index.js #increase to 7gb
node --max-old-space-size=8192 index.js #increase to 8gb

Also, your script may take a long time to run.此外，您的脚本可能需要很长时间才能运行。 If you want to increase performance consider using Map or converting your large array into an object for fast look ups:如果要提高性能，请考虑使用Map或将大型数组转换为对象以进行快速查找：

const obj = a.reduce((obj, current) => {
  obj[current.applicationNumber] = current;
  return obj;
}, {});

You can then look up full details in constant time:然后，您可以在恒定时间内查找完整详细信息：

const fullDetailsOfFirstObject = obj[B[0].number];

Answer 2

Maybe not the most effective one but an approach that will work is:也许不是最有效的方法，但可行的方法是：

1) Import Dataset A (the huge one) into a database. 1）将数据集A（巨大的）导入数据库。 For example sqlite or a database that you are familiar with.例如sqlite或您熟悉的数据库。

2) Add indexing for the field applicationNumber . 2) 为字段applicationNumber添加索引。

3) Query the database for each of the elements in Dataset B or try querying in bulk (selecting more than one at a time). 3) 为Dataset B 中的每个元素查询数据库或尝试批量查询（一次选择多个）。

I've done this before for a similar use case and it worked but still, in your case, there might be better ways of doing it.我之前为类似的用例做过这件事并且它有效，但在你的情况下，可能有更好的方法来做到这一点。

基于另一个数据集过滤大型 JSON 数据集

问题描述

2 个解决方案

解决方案1
1 2019-11-29 15:48:36

解决方案2
1 2019-11-29 15:54:48

基于另一个数据集过滤大型 JSON 数据集

问题描述

2 个解决方案

解决方案1 1 2019-11-29 15:48:36

解决方案2 1 2019-11-29 15:54:48

解决方案1
1 2019-11-29 15:48:36

解决方案2
1 2019-11-29 15:54:48