简体   繁体   English

有没有更有效的方法来评估字符串的遏制?

[英]Is there a more efficient way to assess containment of strings?

I have to execute this line of cose several million times, I wonder if there is a way to optimize it (maybe precomputing something?). 我必须执行这行数百万次,我想知道是否有一种方法可以优化它(可能预先计算一些东西?)。

a.contains(b) || b.contains(a)

Thank you 谢谢

edit: the code executed by the contains method already checks for a.length < b.length. 编辑:contains方法执行的代码已经检查了a.length <b.length。

public static int indexOf(byte[] value, int valueCount, byte[] str, int strCount, int fromIndex) {
    byte first = str[0];
    int max = (valueCount - strCount);
    for (int i = fromIndex; i <= max; i++) {
        [...]
    }
    return -1;
}

As I understand the task, you have to check whether a contains b or vice versa for each pair of a and b from a set of about 35 million words. 据我了解的任务,你必须检查是否a包含b或者反之为每对ab从一组约35万字。 That's a lot of pairs to check. 这要检查很多对。

You should be able to narrow the search down considerable by precomputing which n-grams a word contains: If a contains some n-gram, then b has to contain the same n-gram if b contains a . 你应该能够通过预先计算一个单词包含的n-gram来缩小搜索范围:如果a包含一些n-gram,那么如果b包含a ,则b必须包含相同的n-gram。 You could eg precompute all the trigrams that each word in the list contains, and at the same time all the words that contain a given trigram, then you can just look up the words in those dictionaries and with some set operations get a small set of candidates to check properly. 你可以预先计算列表中每个单词所包含的所有三元组,同时预测包含给定三元组的所有单词,然后你可以只查找那些字典中的单词并使用一些集合操作得到一小组考生要正确检查。

In pseudo-code: 在伪代码中:

  • select a size for the n-grams (see below) 选择n-gram的大小(见下文)
  • initialize a Map<String, Set<String>> ngram_to_word 初始化Map<String, Set<String>> ngram_to_word
  • first iteration: for each word a in your data set 第一次迭代:对于数据集中的每个单词a
    • iterate all the n-grams (eg using some sort of sliding window) of a 遍历所有的n-gram(例如,使用某种形式的滑动窗口的)的a
    • for each, add a to the sets of words containing those n-grams in ngrams_to_words 对于每个,添加a至集包含这些的n-gram中的单词ngrams_to_words
  • second iteration: for each word a in your data set 第二次迭代:对于数据集中的每个单词a
    • again get all the n-grams a contains 再次让所有的n克a包含
    • for each of those, get the set of words that contain that n-gram from ngrams_to_words 对于每一个,从ngrams_to_words获取包含该n-gram的ngrams_to_words
    • get the intersection of those sets of words 得到那些单词集的交集
    • for each word b in that intersection that contains all the n-grams that a contains (but maybe in a different order or quantity), properly check whether b contains a 对每个字b在该交叉点包含所有的n-gram的是a包含(但也许以不同的顺序或数量),正确地检查是否b包含a

Depending on the number of letters in those n-grams (eg bigrams, trigrams, ...), they will be more expensive to pre-compute, in both time and space, but the effect will also be greater. 根据那些n-gram中的字母数量(例如bigrams,trigrams,......),在时间和空间上预计算的成本会更高,但效果也会更大。 In the simplest case, you could even just precompute which words contain a given letter (ie "1-grams"); 在最简单的情况下,您甚至可以预先计算哪些单词包含给定的字母(即“1-gram”); that should be fast and already considerable narrow down the words to check. 这应该是快速的,已经相当缩小了要检查的词。 Of course, the n-grams should not be shorter than the shortest of the words in the data set, but you could even use two length of n-grams, eg use two maps letter_to_words and trigrams_to_words . 当然,正克不应该比最短的在数据集中的话短,但你甚至可以使用两个长度为N-克,例如使用两个地图letter_to_wordstrigrams_to_words

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM