[英]Regex match if all characters in a dictionary word are present in the phrase. The number of times each character occurs must also match in each other
I'm writing a recursive backtracking search to find anagrams for a phrase.我正在编写递归回溯搜索来查找短语的字谜。 For the first step, I'm trying to filter out all the wrong words from a dictionary before I feed it to the recursive algorithm.
第一步,我试图在将字典提供给递归算法之前从字典中过滤掉所有错误的单词。
The dictionary file looks like this:字典文件如下所示:
aback
abacus
abalone
abandon
abase
...
[40,000 more words]
The regex I want to construct must filter out words that contain characters that the phrase do not contain, and also words that contain more occurrences of a character than exists in the phrase.我要构建的正则表达式必须过滤掉包含短语不包含的字符的单词,以及包含比短语中存在的字符更多的单词。
For example, given the phrase "clint eastwood", the word "noodle" matches, but the word "stonewall" does not, since "stonewall" contains more "l" characters than "clint eastwood" does.例如,给定短语“clint eastwood”,单词“noodle”匹配,但单词“stonewall”不匹配,因为“stonewall”包含的“l”字符比“clint eastwood”包含的字符多。
Simply using "[clint eastwood]+"
as the regex almost does what I want, but it includes words with any number of the characters in the phrase.简单地使用
"[clint eastwood]+"
作为正则表达式几乎可以满足我的要求,但它包含短语中包含任意数量字符的单词。
A regex is the wrong tool for comparing character counts.正则表达式是比较字符数的错误工具。 Any regex that satisfies this requirement is likely to be awkward and terribly inefficient.
任何满足此要求的正则表达式都可能很笨拙且效率极低。 You will be far better off traversing each word and keeping track of the individual character counts.
遍历每个单词并跟踪单个字符数会更好。
Anyway, here is a method for constructing a regex that matches the "wrong words" (the other way around is much harder): First, from the set of distinct characters {a1,...,aN}
contained in the phrase, you can match all words containing any illegal character with [^a1,...,aN]
.无论如何,这里有一种构造匹配“错误单词”的正则表达式的方法(反过来更难):首先,从短语中包含的一组不同字符
{a1,...,aN}
中,您可以匹配包含任何非法字符的所有单词[^a1,...,aN]
。 Then, for each character c
that appears n
times in your target string, build a sub-expression (.*c.*){n+1}
, then join these fragments with |
然后,对于在目标字符串中出现
n
次的每个字符c
,构建一个子表达式(.*c.*){n+1}
,然后将这些片段与|
. . For
clint eastwood
you should get:对于
clint eastwood
,你应该得到:
(.*c.*){2}|(.*l.*){2}|(.*i.*){2}|(.*n.*){2}|(.*t.*){3}|(.*e.*){2}|(.*a.*){2}|(.*s.*){2}|(.*w.*){2}|(.*o.*){3}|(.*d.*){2}|[^clinteaswod]
As stated in the previous answer, regex is not what you should be looking at.如上一个答案所述,正则表达式不是您应该查看的内容。 You need to record character counts for each word to quickly filter invalid rows later on.
您需要记录每个单词的字符数,以便稍后快速过滤无效行。 I have a solution that uses a
Map<String, Map<Character, Integer>>
to do so.我有一个使用
Map<String, Map<Character, Integer>>
的解决方案。
Map<String, Map<Character, Integer>> wordCharacterCount = new HashMap<>();
try (Scanner scanner = new Scanner(new File(...))) {
while (scanner.hasNextLine()) {
String word = scanner.nextLine();
Map<Character, Integer> characterCount = new HashMap<>();
char[] characters = word.toCharArray();
for (int i = 0; i < characters.length; i++) {
char c = Character.toLowerCase(characters[i]);
if (Character.isLetter(c)) {
if (!characterCount.containsKey(c)) {
characterCount.put(c, 1);
} else {
characterCount.put(c, characterCount.get(c) + 1);
}
}
}
wordCharacterCount.put(word, characterCount);
}
}
I used the Stream API for simplicity.为简单起见,我使用了 Stream API。 For every phrase you would like to filter the dictionary entries on, you construct a similar Map<Character, Integer> and iterate through the Map to filter entries depending on whether it (1) contains invalid characters or (2) has a character count greater than that of the provided phrase.
对于您想要过滤字典条目的每个短语,您构造一个类似的 Map<Character, Integer> 并遍历 Map 以过滤条目,具体取决于它是否 (1) 包含无效字符或 (2) 具有更大的字符数比提供的短语。
String testWord = "clint eastwood";
char[] characters = testWord.toCharArray();
for (int i = 0; i < characters.length; i++) {
char c = Character.toLowerCase(characters[i]);
if (Character.isLetter(c)) {
if (!testWordCharacterCount.containsKey(c)) {
testWordCharacterCount.put(c, 1);
} else {
testWordCharacterCount.put(c, testWordCharacterCount.get(c) + 1);
}
}
}
List<String> validWords = wordCharacterCount.keySet().stream()
.filter(word -> {
Map<Character, Integer> currentWordCharacterCount = wordCharacterCount.get(word);
for (Entry<Character, Integer> entry : currentWordCharacterCount.entrySet()) {
char c = entry.getKey();
int count = entry.getValue();
if (!testWordCharacterCount.containsKey(c) || testWordCharacterCount.get(c) < count) {
return false;
}
}
return true;
}).collect(Collectors.toList());
I didn't benchmark this thoroughly but in my usage, with a dictionary of 460,000 entries, preprocessing took ~600ms and filters took ~50-150ms each.我没有对此进行彻底的基准测试,但在我的使用中,使用包含 460,000 个条目的字典,预处理需要大约 600 毫秒,过滤器每个需要大约 50-150 毫秒。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.