简体   繁体   English

基于另一个数据集过滤大型 JSON 数据集

[英]Filter a large JSON dataset based another dataset

I have a large JSON Dataset A (180,000 records) containing user's complete records and another JSON Dataset B (which is a subset of A) containing only some user's unique ID and name (about 1,500 records).我有一个包含用户完整记录的大型 JSON 数据集 A(180,000 条记录)和另一个 JSON 数据集 B(它是 A 的子集)仅包含一些用户的唯一 ID 和名称(大约 1,500 条记录)。 I need to get the complete records for the users in Dataset B from Dataset A.我需要从数据集 A 中获取数据集 B 中用户的完整记录。

Here is I've tried so far这是我迄今为止尝试过的

let detailedSponsoreApplicants = [];
let j;
        for(j=0; j < allApplicants.length; j++){
            let a = allApplicants[j];

            let i;
            for(i=0; i < sponsoredApplicants.length;; i++){
                let s = sponsoredApplicants[i];
                if (s && s.number === a.applicationNumber) {
                    detailedSponsoreApplicants.push(a);
                }else{                
                    if(s){
                        logger.warn(`${s.number} not found in master list`);
                    }
                }
            }

        }

The problem with the above code is that at some point I get the error FATAL ERROR: Ineffective mark-compacts near heap limit Allocation failed - JavaScript heap out of memory上面代码的问题是,在某些时候我收到错误FATAL ERROR: Ineffective mark-compacts near heap limit Allocation failed - JavaScript heap out of memory

So, how do I efficiently achieve the task without the errors.那么,我如何在没有错误的情况下有效地完成任务。

EDIT - SAMPLE JSON编辑 - 示例 JSON

Dataset A
{
  "applicationNumber": "3434343"
  "firstName": "dcds",
  "otherNames": "sdcs",
  "surname": "sdcs"
  "phone": "dscd",
  .
  .
  .
  "stateOfOrigin": "dcsd"
}

Dataset B
{
    "number": "3434343",
    "fullName": "dcds sdcs sdcs"
}

Try giving node more memory to work with:尝试为节点提供更多内存以供使用:

node --max-old-space-size=1024 index.js #increase to 1gb
node --max-old-space-size=2048 index.js #increase to 2gb
node --max-old-space-size=3072 index.js #increase to 3gb
node --max-old-space-size=4096 index.js #increase to 4gb
node --max-old-space-size=5120 index.js #increase to 5gb
node --max-old-space-size=6144 index.js #increase to 6gb
node --max-old-space-size=7168 index.js #increase to 7gb
node --max-old-space-size=8192 index.js #increase to 8gb

Also, your script may take a long time to run.此外,您的脚本可能需要很长时间才能运行。 If you want to increase performance consider using Map or converting your large array into an object for fast look ups:如果要提高性能,请考虑使用Map或将大型数组转换为对象以进行快速查找:

const obj = a.reduce((obj, current) => {
  obj[current.applicationNumber] = current;
  return obj;
}, {});

You can then look up full details in constant time:然后,您可以在恒定时间内查找完整详细信息:

const fullDetailsOfFirstObject = obj[B[0].number];

Maybe not the most effective one but an approach that will work is:也许不是最有效的方法,但可行的方法是:

1) Import Dataset A (the huge one) into a database. 1)将数据集A(巨大的)导入数据库。 For example sqlite or a database that you are familiar with.例如sqlite或您熟悉的数据库。

2) Add indexing for the field applicationNumber . 2) 为字段applicationNumber添加索引。

3) Query the database for each of the elements in Dataset B or try querying in bulk (selecting more than one at a time). 3) 为Dataset B 中的每个元素查询数据库或尝试批量查询(一次选择多个)。

I've done this before for a similar use case and it worked but still, in your case, there might be better ways of doing it.我之前为类似的用例做过这件事并且它有效,但在你的情况下,可能有更好的方法来做到这一点。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 按 4 级深的字段过滤 json 数据集 - Filter a json dataset by a field 4 levels deep JSON数据集结构 - JSON Dataset structure 大规模数据集的内核方法 - Kernel methods for large scale dataset 索引大型3D HDF5数据集以基于2D条件进行子集 - Indexing a large 3D HDF5 dataset for subsetting based on 2D condition 将大数据集转换为2D数组,然后根据总列将其转换为2D倍数 - Converting a large dataset into 2D Array and then into 2D Multiples based in condtion to Total Columns 使用Python筛选大型数据集并将该数据的位存储在另一个脚本中 - Sifting through large dataset with Python and storing bits of that data to be analyzed in another script 如何在 python 的另一个数据集中满足标准的数据集的 select 部分? - How to select parts of a dataset where a criteria is met in another dataset in python? 从 javascript 数据集中删除一行并将其转移到另一个数据集中? - Removing a row from a javascript dataset and carrying it over into another dataset? VBA根据切片器选择自动筛选数据集 - VBA to AutoFilter Dataset Based on Slicer Selections jQuery-获取对象数组,然后根据数据集进行排序 - jQuery - Take array of objects, then sort based on a dataset
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM