简体   繁体   English

在字典中的字符串和键之间查找最长的公共 substring 后缀

[英]Find the longest common substring suffix between a string and a key in a dictionary

i'm trying to make a program that ouputs the longest common suffix string between a string and a key inside a dictionary.我正在尝试制作一个程序,该程序在字符串和字典中的键之间输出最长的公共后缀字符串。

Easy example: The dictionary has about 6000 key:value pairs so i won't include the whole dictionary.简单的例子:字典有大约 6000 个键:值对,所以我不会包括整个字典。 For information the key length are from 2 up to 7 characters.有关信息,密钥长度为 2 到 7 个字符。

codeCountry = {
    'AFHAS': 'AFGHANISTAN',
    'AXUYFF': 'ÅLAND ISLANDS',
    'ALUU': 'ALBANIA',
    'DZBG': 'ALGERIA',
    'ASSQ': 'AMERICAN SAMOA',
    'ADDD': 'ANDORRA',
    'ANGO': 'ANGOLA',
    'ANGI': 'ANGUILLA',
    'AQ': 'ANTARCTICA',
    'AG': 'ANTIGUA AND BARBUDA',
    'AMENI': 'ARMENIA',
    'AURI': 'ARUBA',
    'AUR': 'ARGENTINA',
    'AURII': 'AUSTRALIA'
     ...

}

As string i will take "AMAURI" as example so it's more clear (the string is generated randomly and has variable length from one character up to 16, but it always contains one of the suffixes (keys) from the dictionary):作为字符串,我将以“AMAURI”为例,这样更清楚(字符串是随机生成的,长度可变,从一个字符到 16 个字符,但它始终包含字典中的后缀(键)之一):

strToUse = "AMAURI"

Expected Result: "ARUBA" because the longest common suffix between the string and the keys in the dictionary is "AURI" so -> "AURI":"ARUBA".预期结果:“ARUBA”,因为字符串和字典中的键之间的最长公共后缀是“AURI”,所以 ->“AURI”:“ARUBA”。

How can i do this is python? python 我该怎么做? I tried something like this (I'm new to python):我尝试了这样的事情(我是 python 新手):

for country in codeCountry:
 if country in strToUse:
   print(codeCountry.get(country))

But this prints me "ARGENTINA" which isn't correct, i don't understand why exactly.但这会打印出我不正确的“阿根廷”,我不明白为什么。 There are similar problems here on stackoverflow but my problem is different in the sense that it looks for the suffix and not just any character inside the string. stackoverflow 上也有类似的问题,但我的问题是不同的,因为它查找后缀,而不仅仅是字符串中的任何字符。 I hope i was clear, i'm really confused myself and don't know how to do it, can anybody help me please?我希望我很清楚,我自己真的很困惑,不知道该怎么做,有人可以帮助我吗? Or atleast point me in the right direction?或者至少指出我正确的方向?

You can sort the keys by length first and then check them您可以先按长度对键进行排序,然后检查它们

strToUse = "AMAURI"
for country in sorted(codeCountry.keys(),key=len,reverse=True):
    if country in strToUse:
        print(codeCountry.get(country))
        break
ARUBA

Try out the code below and see if it works for you.试试下面的代码,看看它是否适合你。 stringSubsets() returns a set of all possible keys (country codes) that could be constructed from your input string ("AMAURI" in your example). stringSubsets() 返回一组可以从您的输入字符串(在您的示例中为“AMAURI”)构造的所有可能的键(国家代码)。 Set intersection is then used on the codeCountry dict to provide all the keys that match a substring in the set returned by stringSubsets().然后在 codeCountry 字典上使用集合交集来提供与 stringSubsets() 返回的集合中的 substring 匹配的所有键。 The print statement in the last line shows how you extract the value of the largest matching key, or return None if not key matches to avoid a key error.最后一行的 print 语句显示了如何提取最大匹配键的值,或者如果键不匹配则返回 None 以避免键错误。

If for some reason your input strings (in this case "AMAURI") are excessively long and you need to speed your code up, then you might be able to use something more advanced like the Aho Corasick algorithm.如果由于某种原因您的输入字符串(在本例中为“AMAURI”)过长并且您需要加快代码速度,那么您可以使用更高级的东西,例如 Aho Corasick 算法。 If you go this route, you might be able to invert your methodology and actually search your input string for the longest key in your dict (vs searching dict for a substring).如果您 go 这条路线,您可能能够颠倒您的方法并实际在您的输入字符串中搜索您的 dict 中最长的键(而不是在 dict 中搜索子字符串)。 This could work well because your codeCountry dict probably won't change often, so the trie that Aho Corasick uses to function could be built ahead of time using your dict keys, making your search on the input string very fast.这可以很好地工作,因为您的 codeCountry dict 可能不会经常更改,因此 Aho Corasick 用于 function 的 trie 可以使用您的 dict 键提前构建,从而使您对输入字符串的搜索非常快速。

codeCountry = {
'AFHAS': 'AFGHANISTAN',
'AXUYFF': 'ÅLAND ISLANDS',
'ALUU': 'ALBANIA',
'DZBG': 'ALGERIA',
'ASSQ': 'AMERICAN SAMOA',
'ADDD': 'ANDORRA',
'ANGO': 'ANGOLA',
'ANGI': 'ANGUILLA',
'AQ': 'ANTARCTICA',
'AG': 'ANTIGUA AND BARBUDA',
'AMENI': 'ARMENIA',
'AURI': 'ARUBA',
'AUR': 'ARGENTINA',
'AURII': 'AUSTRALIA'
}

def stringSubsets(s):
    out = set()
    for i in range(len(s)):
        for j in range(i+1, len(s)+1):
            out.add(s[i:j])

    return out

code = "AMAURI"
candidates = stringSubsets(code)
keys = candidates.intersection(codeCountry)

# results in None if no substring matches a key in dict, else give the  
# value of the longest matching key
print(None if not keys else codeCountry[max(keys)])

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM