[英]Filter a large JSON dataset based another dataset
I have a large JSON Dataset A (180,000 records) containing user's complete records and another JSON Dataset B (which is a subset of A) containing only some user's unique ID and name (about 1,500 records).我有一个包含用户完整记录的大型 JSON 数据集 A(180,000 条记录)和另一个 JSON 数据集 B(它是 A 的子集)仅包含一些用户的唯一 ID 和名称(大约 1,500 条记录)。 I need to get the complete records for the users in Dataset B from Dataset A.
我需要从数据集 A 中获取数据集 B 中用户的完整记录。
Here is I've tried so far这是我迄今为止尝试过的
let detailedSponsoreApplicants = [];
let j;
for(j=0; j < allApplicants.length; j++){
let a = allApplicants[j];
let i;
for(i=0; i < sponsoredApplicants.length;; i++){
let s = sponsoredApplicants[i];
if (s && s.number === a.applicationNumber) {
detailedSponsoreApplicants.push(a);
}else{
if(s){
logger.warn(`${s.number} not found in master list`);
}
}
}
}
The problem with the above code is that at some point I get the error FATAL ERROR: Ineffective mark-compacts near heap limit Allocation failed - JavaScript heap out of memory
上面代码的问题是,在某些时候我收到错误
FATAL ERROR: Ineffective mark-compacts near heap limit Allocation failed - JavaScript heap out of memory
So, how do I efficiently achieve the task without the errors.那么,我如何在没有错误的情况下有效地完成任务。
EDIT - SAMPLE JSON编辑 - 示例 JSON
Dataset A
{
"applicationNumber": "3434343"
"firstName": "dcds",
"otherNames": "sdcs",
"surname": "sdcs"
"phone": "dscd",
.
.
.
"stateOfOrigin": "dcsd"
}
Dataset B
{
"number": "3434343",
"fullName": "dcds sdcs sdcs"
}
Try giving node more memory to work with:尝试为节点提供更多内存以供使用:
node --max-old-space-size=1024 index.js #increase to 1gb
node --max-old-space-size=2048 index.js #increase to 2gb
node --max-old-space-size=3072 index.js #increase to 3gb
node --max-old-space-size=4096 index.js #increase to 4gb
node --max-old-space-size=5120 index.js #increase to 5gb
node --max-old-space-size=6144 index.js #increase to 6gb
node --max-old-space-size=7168 index.js #increase to 7gb
node --max-old-space-size=8192 index.js #increase to 8gb
Also, your script may take a long time to run.此外,您的脚本可能需要很长时间才能运行。 If you want to increase performance consider using Map or converting your large array into an object for fast look ups:
如果要提高性能,请考虑使用Map或将大型数组转换为对象以进行快速查找:
const obj = a.reduce((obj, current) => {
obj[current.applicationNumber] = current;
return obj;
}, {});
You can then look up full details in constant time:然后,您可以在恒定时间内查找完整详细信息:
const fullDetailsOfFirstObject = obj[B[0].number];
Maybe not the most effective one but an approach that will work is:也许不是最有效的方法,但可行的方法是:
1) Import Dataset A (the huge one) into a database. 1)将数据集A(巨大的)导入数据库。 For example sqlite or a database that you are familiar with.
例如sqlite或您熟悉的数据库。
2) Add indexing for the field applicationNumber
. 2) 为字段
applicationNumber
添加索引。
3) Query the database for each of the elements in Dataset B or try querying in bulk (selecting more than one at a time). 3) 为Dataset B 中的每个元素查询数据库或尝试批量查询(一次选择多个)。
I've done this before for a similar use case and it worked but still, in your case, there might be better ways of doing it.我之前为类似的用例做过这件事并且它有效,但在你的情况下,可能有更好的方法来做到这一点。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.