简体   繁体   English

如果字典单词中的所有字符都出现在短语中,则正则表达式匹配。 每个字符出现的次数也必须相互匹配

[英]Regex match if all characters in a dictionary word are present in the phrase. The number of times each character occurs must also match in each other

I'm writing a recursive backtracking search to find anagrams for a phrase.我正在编写递归回溯搜索来查找短语的字谜。 For the first step, I'm trying to filter out all the wrong words from a dictionary before I feed it to the recursive algorithm.第一步,我试图在将字典提供给递归算法之前从字典中过滤掉所有错误的单词。

The dictionary file looks like this:字典文件如下所示:

aback
abacus
abalone
abandon
abase
... 
[40,000 more words]

The regex I want to construct must filter out words that contain characters that the phrase do not contain, and also words that contain more occurrences of a character than exists in the phrase.我要构建的正则表达式必须过滤掉包含短语不包含的字符的单词,以及包含比短语中存在的字符更多的单词。

For example, given the phrase "clint eastwood", the word "noodle" matches, but the word "stonewall" does not, since "stonewall" contains more "l" characters than "clint eastwood" does.例如,给定短语“clint eastwood”,单词“noodle”匹配,但单词“stonewall”不匹配,因为“stonewall”包含的“l”字符比“clint eastwood”包含的字符多。

Simply using "[clint eastwood]+" as the regex almost does what I want, but it includes words with any number of the characters in the phrase.简单地使用"[clint eastwood]+"作为正则表达式几乎可以满足我的要求,但它包含短语中包含任意数量字符的单词。

A regex is the wrong tool for comparing character counts.正则表达式是比较字符数的错误工具。 Any regex that satisfies this requirement is likely to be awkward and terribly inefficient.任何满足此要求的正则表达式都可能很笨拙且效率极低。 You will be far better off traversing each word and keeping track of the individual character counts.遍历每个单词并跟踪单个字符数会更好。

Anyway, here is a method for constructing a regex that matches the "wrong words" (the other way around is much harder): First, from the set of distinct characters {a1,...,aN} contained in the phrase, you can match all words containing any illegal character with [^a1,...,aN] .无论如何,这里有一种构造匹配“错误单词”的正则表达式的方法(反过来更难):首先,从短语中包含的一组不同字符{a1,...,aN}中,您可以匹配包含任何非法字符的所有单词[^a1,...,aN] Then, for each character c that appears n times in your target string, build a sub-expression (.*c.*){n+1} , then join these fragments with |然后,对于在目标字符串中出现n次的每个字符c ,构建一个子表达式(.*c.*){n+1} ,然后将这些片段与| . . For clint eastwood you should get:对于clint eastwood ,你应该得到:

(.*c.*){2}|(.*l.*){2}|(.*i.*){2}|(.*n.*){2}|(.*t.*){3}|(.*e.*){2}|(.*a.*){2}|(.*s.*){2}|(.*w.*){2}|(.*o.*){3}|(.*d.*){2}|[^clinteaswod]

As stated in the previous answer, regex is not what you should be looking at.如上一个答案所述,正则表达式不是您应该查看的内容。 You need to record character counts for each word to quickly filter invalid rows later on.您需要记录每个单词的字符数,以便稍后快速过滤无效行。 I have a solution that uses a Map<String, Map<Character, Integer>> to do so.我有一个使用Map<String, Map<Character, Integer>>的解决方案。

Map<String, Map<Character, Integer>> wordCharacterCount = new HashMap<>();
try (Scanner scanner = new Scanner(new File(...))) {
    while (scanner.hasNextLine()) {
        String word = scanner.nextLine();
        Map<Character, Integer> characterCount = new HashMap<>();
        char[] characters = word.toCharArray();
        for (int i = 0; i < characters.length; i++) {
            char c = Character.toLowerCase(characters[i]);
            if (Character.isLetter(c)) {
                if (!characterCount.containsKey(c)) {
                    characterCount.put(c, 1);
                } else {
                    characterCount.put(c, characterCount.get(c) + 1);
                }
            }
        }
        wordCharacterCount.put(word, characterCount);
    }
}

I used the Stream API for simplicity.为简单起见,我使用了 Stream API。 For every phrase you would like to filter the dictionary entries on, you construct a similar Map<Character, Integer> and iterate through the Map to filter entries depending on whether it (1) contains invalid characters or (2) has a character count greater than that of the provided phrase.对于您想要过滤字典条目的每个短语,您构造一个类似的 Map<Character, Integer> 并遍历 Map 以过滤条目,具体取决于它是否 (1) 包含无效字符或 (2) 具有更大的字符数比提供的短语。

String testWord = "clint eastwood";
char[] characters = testWord.toCharArray();
for (int i = 0; i < characters.length; i++) {
    char c = Character.toLowerCase(characters[i]);
    if (Character.isLetter(c)) {
        if (!testWordCharacterCount.containsKey(c)) {
            testWordCharacterCount.put(c, 1);
        } else {
            testWordCharacterCount.put(c, testWordCharacterCount.get(c) + 1);
        }
    }
}

List<String> validWords = wordCharacterCount.keySet().stream()
        .filter(word -> {
            Map<Character, Integer> currentWordCharacterCount = wordCharacterCount.get(word);
            for (Entry<Character, Integer> entry : currentWordCharacterCount.entrySet()) {
                char c = entry.getKey();
                int count = entry.getValue();
                if (!testWordCharacterCount.containsKey(c) || testWordCharacterCount.get(c) < count) {
                    return false;
                }
            }
            return true;
        }).collect(Collectors.toList());

I didn't benchmark this thoroughly but in my usage, with a dictionary of 460,000 entries, preprocessing took ~600ms and filters took ~50-150ms each.我没有对此进行彻底的基准测试,但在我的使用中,使用包含 460,000 个条目的字典,预处理需要大约 600 毫秒,过滤器每个需要大约 50-150 毫秒。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 正则表达式匹配字符(除其他字符外) - Regex match character except within other characters 正则表达式匹配包含字符的每一行 - Regex match each line that contains character 检查输入条目中的每个位置并返回出现字符的次数 - Check each position in the input entry and return the number of times a character occurs 查找哈希集中每个单词在文本文档中出现的次数 - Finding the number of times each word in a hashset occurs in text document Java正则表达式模式以匹配单词或短语 - Java regex pattern to match word or phrase 我可以使正则表达式完全匹配 class 中的每个字符 - 即使字符在 class 中重复? - Can I make a regex match each character in a class EXACTLY ONCE - even when characters REPEAT in the class? 如何使用 Scanner.useDelimiter() 匹配两个相邻的字符后跟一个单词? - How to use Scanner.useDelimiter() to match two characters next to each other followed by a word? Java正则表达式匹配至少出现两次的单词 - Java regex to match a word that occurs at least twice 仅当一个字符在匹配中出现n次时如何匹配? - How to match only if a character occurs n times inside the match? 使用也处理撇号的正则表达式匹配单词 - Match a word using regex that also handles apostrophes
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM