简体   繁体   中英

How do I loop over a VERY LARGE 2D array without causing a major performace hit?

I am attempting to iterate over a very large 2D array in JavaScript within an ionic application, but it is majorly bogging down my app.

A little background, I created custom search component with StencilJS that provides suggestions upon keyup. You feed the component with an array of strings (search suggestions). Each individual string is tokenized word by word and split into an array and lowercase

For example, "Red-Winged Blackbird" becomes

['red','winged','blackbird']

So, tokenizing an array of strings looks like this:

[['red','winged','blackbird'],['bald','eagle'], ...]

Now, I have 10,000+ of these smaller arrays within one large array.

Then, I tokenize the search terms the user inputs upon each keyup.

Afterwards, I am comparing each tokenized search term array to each tokenized suggestion array within the larger array.

Therefore, I have 2 nested for-of loops.

In addition, I am using Levenshtein distance to compare each search term to each element of each suggestion array.

I had a couple drinks so please be patient while i stumble through this.

To start id do something like a reverse index (not very informative). Its pretty close to what you are already doing but with a couple extra steps.

First go through all your results and tokenize, stem, remove stops words, decap, coalesce, ects. It looks like you've already done this but i'm adding an example for completion.

const tokenize = (string) => {
  const tokens = string
    .split(' ') // just split on words, but maybe rep
    .filter((v) => v.trim() !== '');
    
  return new Set(tokens);
};

Next what we want to do is generate a map that takes a word as an key and returns us a list of document indexes the word appears in.

const documents = ['12312 taco', 'taco mmm'];
const index = {
  '12312': [0],
  'taco': [0, 1],
  'mmm': [2]
};

I think you can see where this is taking us... We can tokenize our search term and find all documents each token belongs, to work some ranking magic, take top 5, blah blah blah, and have our results. This is typically the way google and other search giants do their searches. They spend a ton of time in precomputation so that their search engines can slice down candidates by orders of magnitude and work their magic.

Below is an example snippet. This needs a ton of work(please remember, ive been drinking) but its running through a million records in >.3ms. Im cheating a bit by generate 2 letter words and phrases, only so that i can demonstrate queries that sometimes achieve collision. This really doesn't matter since the query time is on average propionate to the number of records. Please be aware that this solution gives you back records that contain all search terms. It doesn't care about context or whatever. You will have to figure out the ranking (if your care at this point) to achieve the results you want.

const tokenize = (string) => {
  const tokens = string.split(' ')
    .filter((v) => v.trim() !== '');
    
  return new Set(tokens);
};

const ri = (documents) => {
  const index = new Map();
  
  for (let i = 0; i < documents.length; i++) {
    const document = documents[i];
    const tokens = tokenize(document);
    
    for (let token of tokens) {
      if (!index.has(token)) {
        index.set(token, new Set());
      }
      
      index.get(token).add(i);
    }
  }
  
  return index;
};

const intersect = (sets) => {
  const [head, ...rest] = sets;
  
  return rest.reduce((r, set) => {
    return new Set([...r].filter((n) => set.has(n)))
  }, new Set(head));
};

const search = (index, query) => {
  const tokens = tokenize(query);
  const canidates = [];
  
  for (let token of tokens) {
    const keys = index.get(token);
    
    if (keys != null) {
        canidates.push(keys);
    }
  }
  
  return intersect(canidates);
}

const word = () => Math.random().toString(36).substring(2, 4);
const terms = Array.from({ length: 255 }, () => word());

const documents = Array.from({ length: 1000000 }, () => {
  const sb = [];
  
  for (let i = 0; i < 2; i++) {
    sb.push(word());
  }
  
  return sb.join(' ');
});

const index = ri(documents);
const st = performance.now();
const query = 'bb iz';
const results = search(index, query);
const et = performance.now();

console.log(query, Array.from(results).slice(0, 10).map((i) => documents[i]));
console.log(et - st);

There are some improvements you can make if you want. Like... ranking! The whole purpose of this example is to show how we can cut down 1M results to maybe a hundred or so canidates. The search function has some post filtering via intersection which probably isn't what you want you want but at this point it doesn't really matter what you do since the results are so small.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM