简体   繁体   English

如何在巨大的 json 文件之间执行最优模式搜索?

[英]How to perform optimal pattern search between huge json files?

Looking for solution to perform pattern search between JSON files which could traverse the huge JSON file without much impact on performance.寻找在 JSON 文件之间执行模式搜索的解决方案,它可以遍历巨大的 JSON 文件而不会对性能产生太大影响。 Following are the few test cases.以下是几个测试用例。

Search criteria搜索条件


  1. 'cabin_1' matches with 'cabin_1' “cabin_1”与“cabin_1”匹配
  2. 'cabin_3' matches with 'cabin 3' or '3 cabin' 'cabin_3' 与 'cabin 3' 或 '3cabin' 匹配
  3. 'first cabin' matches with '1st cabin' “第一舱”与“第一舱”匹配

Test case files测试用例文件


you can find the test json files here你可以在这里找到测试 json 文件

My Idea我的点子


for each json1Property in json1
     for each json2Property in json2
        isMatch = regex('somepattern', json1property , json2property)
        if (isMatch) 
           return true 
        else 
           return false

This is rather basic and I'm no algorithm expert, but basically, the goal is to build a simple index for each array.这是相当基本的,我不是算法专家,但基本上,目标是为每个数组构建一个简单的索引。 You simplify and map values to something more easy/fast to compare later.您可以简化并将值映射到更容易/更快速的内容,以便稍后进行比较。 I think one way or another, you have to iterate on arrays.我认为一种或另一种方式,你必须迭代数组。

Here, you iterate once on each array to build the indices, while in your first attempt, you have a double loop.在这里,您在每个数组上迭代一次以构建索引,而在您的第一次尝试中,您有一个双循环。

The double loop is a bit existing in the second phase, comparing indices with filter / includes but I think it would be lighter because the arrays' length has decreased and the data is simpler to check.双循环在第二阶段有点存在,将索引与filter / includes进行比较,但我认为它会更轻,因为数组的长度已经减少并且数据更容易检查。

 const data = { "Building": { "floor": [ { "space": [ "cabin_1", "cabin_2", "cabin_3", "mycabin" ] }, { "space": [ "first cabin", "xyz's cabin", "Zone c", "Zone d" ] } ] } }; const spaces = data.Building.floor; const indices = spaces.reduce((acc, item) => { acc.push(item.space.map(it => { return it.replace(/ ?cabin[_ ]?/g, '') //Remove cabin, trailing spaces and underscores. .replace(/1st|first/g, '1') //Map things that are not numbers to numbers. .replace(/2nd|second/g, '2') .replace(/3rd|third/g, '3'); }).filter(it => !isNaN(it))); //Removes every thing that is not processed by the index engine. return acc; }, []); console.log(indices); let shorterArray, longerArray; if(indices[0].length > indices[1].length) { shorterArray = indices[1]; longerArray = indices[0]; } else { shorterArray = indices[0]; longerArray = indices[1]; } const sharedItems = shorterArray.filter(it => longerArray.includes(it)); console.log('Shared items found', !!sharedItems.length, sharedItems);

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM