简体   繁体   English

用于查找包含特定字母子集和这些字母出现频率的单词的正则表达式

[英]Regular expression to find words containing a subset of specific letters and frequencies of these letters

Given specific letters, such as "CDEIORSVY", I would like to use a regular expression that filters all possible words in the English dictionary (which is provided, and all capitalized) to leave k-letter words that contain only these letters.给定特定的字母,例如“CDEIORSVY”,我想使用一个正则表达式来过滤英语词典中所有可能的单词(已提供,并且全部大写),以留下仅包含这些字母的 k 字母单词。 In this example, a 9-letter solution is "DISCOVERY".在此示例中,9 个字母的解决方案是“DISCOVERY”。

This problem is akin to finding suitable words for Scrabble, or finding solutions to the Letters round or Conundrum in the game show Countdown.这个问题类似于为 Scrabble 寻找合适的单词,或者为游戏节目 Countdown 中的 Letters round 或 Conundrum 寻找解决方案。

Given specific letters with repeat letters, such as "DENOOPRSU", a 9-letter solution is "PONDEROUS" but not "SPONSORED".给定具有重复字母的特定字母,例如“DENOOPRSU”,9 个字母的解决方案是“PONDEROUS”而不是“SPONSORED”。 A 7-letter solution is "ONEROUS" but not "USURPER". 7 个字母的解决方案是“ONEROUS”而不是“USURPER”。

My question is what would be the regular expression that takes into account the constraints of specific letters, frequencies of letters and k-letter solutions?我的问题是考虑到特定字母的约束、字母频率和 k 字母解决方案的正则表达式是什么?

My regular expression so far is: "^[DENOOPRSU]{9,9}$" and "^[DENOOPRSU]{7,7}$" for the example above.到目前为止,我的正则表达式是:"^[DENOOPRSU]{9,9}$" 和 "^[DENOOPRSU]{7,7}$" 对于上面的例子。 However, this does not take into regard the constraints on the frequencies of the letters, and produces the incorrect words as in the examples above.然而,这并没有考虑对字母频率的限制,并产生了上面示例中的错误单词。 My workaround is to filter the results from this regular expression by using Counter from the Collections library on Python, but this is very slow.我的解决方法是使用 Python Collections 库中的 Counter 来过滤此正则表达式的结果,但这非常慢。 Therefore, I would like a regular expression that incorporates the constraints of letters and frequencies.因此,我想要一个包含字母和频率约束的正则表达式。

Not sure if it's possible to do this with a single static regex, but a dynamic implementation can be done.不确定是否可以使用单个静态正则表达式执行此操作,但可以完成动态实现。 See sample regex below:请参阅下面的示例正则表达式:

^(?..*([DENPRSU])?*\1)(...*([O]).*\2.*\2)[DENOOPRSU]+

The first group of letters is constrained to only appear once in the solution, and the second group may only appear twice.第一组字母被限制在解决方案中只出现一次,第二组可能只出现两次。 More groups can be appended with the same pattern - for example constraining a letter to appear three times or less would just be (?..*([chosen letters]).*\3.*\3.*\3) .更多的组可以附加相同的模式 - 例如限制一个字母出现三次或更少将只是(?..*([chosen letters]).*\3.*\3.*\3)

Perhaps you could use an approach without a regex, getting the number of occurrences for the letters in a dictionary.也许您可以使用一种没有正则表达式的方法,获取字典中字母的出现次数。

Then check if there is a rest after intersecting the 2 input strings, and if there is then there is a character present that should not be used.然后检查 2 个输入字符串相交后是否有休息,如果有则存在不应使用的字符。

def charsWithCounter(s):
    res = {}
    for c in s:
        res[c] = res.get(c, 0) + 1
    return res

def isValid(str_source, str_to_compare):
    if list(set(str_to_compare) - set(str_source)):
        return False

    dct_source = charsWithCounter(str_source)
    dct_to_compare = charsWithCounter(str_to_compare)

    for key in dct_to_compare.keys():
        if key in dct_source and dct_to_compare[key] > dct_source[key]:
            return False
    return True


dct = {
    "DENOOPRSU": ["PONDEROUS", "PONDEROUSZ", "PONDEROU S", "SPONSORED", "ONEROUS", "USURPER", "ABC", " ", "OOO", "O"],
    "CDEIORSVY": ["DISCOVERY"]
}

for k, v in dct.items():
    for s in v:
        print("'{}' --> '{}' : {}".format(k, s, str(isValid(k, s))))

Output输出

'DENOOPRSU' --> 'PONDEROUS' : True
'DENOOPRSU' --> 'PONDEROUSZ' : False
'DENOOPRSU' --> 'PONDEROU S' : False
'DENOOPRSU' --> 'SPONSORED' : False
'DENOOPRSU' --> 'ONEROUS' : True
'DENOOPRSU' --> 'USURPER' : False
'DENOOPRSU' --> 'ABC' : False
'DENOOPRSU' --> ' ' : False
'DENOOPRSU' --> 'OOO' : False
'DENOOPRSU' --> 'O' : True
'CDEIORSVY' --> 'DISCOVERY' : True

See a Python demo查看Python 演示

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM