[英]Is there a more efficient way to assess containment of strings?
I have to execute this line of cose several million times, I wonder if there is a way to optimize it (maybe precomputing something?). 我必须执行这行数百万次,我想知道是否有一种方法可以优化它(可能预先计算一些东西?)。
a.contains(b) || b.contains(a)
Thank you 谢谢
edit: the code executed by the contains method already checks for a.length < b.length. 编辑:contains方法执行的代码已经检查了a.length <b.length。
public static int indexOf(byte[] value, int valueCount, byte[] str, int strCount, int fromIndex) {
byte first = str[0];
int max = (valueCount - strCount);
for (int i = fromIndex; i <= max; i++) {
[...]
}
return -1;
}
As I understand the task, you have to check whether a
contains b
or vice versa for each pair of a
and b
from a set of about 35 million words. 据我了解的任务,你必须检查是否
a
包含b
或者反之为每对a
和b
从一组约35万字。 That's a lot of pairs to check. 这要检查很多对。
You should be able to narrow the search down considerable by precomputing which n-grams a word contains: If a
contains some n-gram, then b
has to contain the same n-gram if b
contains a
. 你应该能够通过预先计算一个单词包含的n-gram来缩小搜索范围:如果
a
包含一些n-gram,那么如果b
包含a
,则b
必须包含相同的n-gram。 You could eg precompute all the trigrams that each word in the list contains, and at the same time all the words that contain a given trigram, then you can just look up the words in those dictionaries and with some set operations get a small set of candidates to check properly. 你可以预先计算列表中每个单词所包含的所有三元组,同时预测包含给定三元组的所有单词,然后你可以只查找那些字典中的单词并使用一些集合操作得到一小组考生要正确检查。
In pseudo-code: 在伪代码中:
Map<String, Set<String>> ngram_to_word
Map<String, Set<String>> ngram_to_word
a
in your data set a
a
a
a
to the sets of words containing those n-grams in ngrams_to_words
a
至集包含这些的n-gram中的单词ngrams_to_words
a
in your data set a
a
contains a
包含 ngrams_to_words
ngrams_to_words
获取包含该n-gram的ngrams_to_words
b
in that intersection that contains all the n-grams that a
contains (but maybe in a different order or quantity), properly check whether b
contains a
b
在该交叉点包含所有的n-gram的是a
包含(但也许以不同的顺序或数量),正确地检查是否b
包含a
Depending on the number of letters in those n-grams (eg bigrams, trigrams, ...), they will be more expensive to pre-compute, in both time and space, but the effect will also be greater. 根据那些n-gram中的字母数量(例如bigrams,trigrams,......),在时间和空间上预计算的成本会更高,但效果也会更大。 In the simplest case, you could even just precompute which words contain a given letter (ie "1-grams");
在最简单的情况下,您甚至可以预先计算哪些单词包含给定的字母(即“1-gram”); that should be fast and already considerable narrow down the words to check.
这应该是快速的,已经相当缩小了要检查的词。 Of course, the n-grams should not be shorter than the shortest of the words in the data set, but you could even use two length of n-grams, eg use two maps
letter_to_words
and trigrams_to_words
. 当然,正克不应该比最短的在数据集中的话短,但你甚至可以使用两个长度为N-克,例如使用两个地图
letter_to_words
和trigrams_to_words
。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.