简体   繁体   English

Python检查字符串是否包含任何字典的键

[英]Python check if string contains any of a dictionary's keys

Given the example dictionary: 给出示例字典:

LANGUAGE_TO_ISO = {
    "en": "en",
    "eng": "en",
    "english": "en",
    "es": "es",
    "spanish": "es",
    ...
}

And a given example strings: 一个给定的示例字符串:

book_title = "The Dark Tower - english"
book_title = "The Dark Tower - eng"
book_title = "The Dark Tower 2 - english 2nd edition"

Is there a Python function I'm unaware of that would permit to search if a string contains any of the dictionary keys then return the corresponding values, without having to loop in the ISO dictionary? 是否有一个我不知道的Python函数允许搜索字符串是否包含任何字典键然后返回相应的值,而不必循环ISO字典?

This way, I could extract the ISO language from the many different ways a language could have been written. 通过这种方式,我可以从语言可以编写的许多不同方式中提取ISO语言。

If someone knows of a less dirty way of doing this, please share :) 如果有人知道不那么脏的方式,请分享:)

UPDATE : As Willem mentioned, forgot to specify that "english", "eng", "spanish" etc would be separated by words. 更新 :正如威廉所提到的,忘了指明“英语”,“英语”,“西班牙语”等会被单词分开。 Either a dot, comma, hyphen, space, ... 点,逗号,连字符,空格,......

I dont't know if that's optimal way, and yet I have a loop, however it's pretty compact: 我不知道这是否是最佳方式,但我有一个循环,但它非常紧凑:

def has_key_in(dictionary, string):
  return any(k in string for k in dictionary)

The advantage, if I'm not wrong, is that any stops at the first encountered True condition. 如果我没有错,那么优势在于第一次遇到True条件时的any停止。

Now, the problem is that you don't have the corresponding value... 现在,问题是你没有相应的价值......

This should give you the common key: 这应该给你共同的关键:

set(book_title.split()).intersection(set(LANGUAGE_TO_ISO.keys()))

which you can lookup into the dictionary to get the corresponding value. 您可以在字典中查找以获取相应的值。


in response to comment from OP, including a snippet of the output on the shell: 响应来自OP的评论,包括shell上的输出片段:

In [4]: LANGUAGE_TO_ISO = { 
   ...:     "en": "en", 
   ...:     "eng": "en", 
   ...:     "english": "en", 
   ...:     "es": "es", 
   ...:     "spanish": "es", 
   ...: }                                                                                                                                                       

In [5]: book_title = "The Dark Tower - english"                                                                                                                 

In [6]: set(book_title.split()).intersection(set(LANGUAGE_TO_ISO.keys()))                                                                                       
Out[6]: {'english'}

The less complex way of doing it would be to try to replace each word of the sentence using regular expressions and try to replace the word by another one using a replacement function, defaulting to the current word if not found: 不太复杂的方法是尝试使用正则表达式替换句子中的每个单词,并尝试使用替换函数替换另一个单词,如果找不到,则默认为当前单词:

LANGUAGE_TO_ISO = {
    "en": "en",
    "eng": "en",
    "english": "en",
    "es": "es",
    "spanish": "es",
}

book_title = "The Dark Tower - english"

import re

print(re.sub(r"\b(\w+)\b",lambda m : LANGUAGE_TO_ISO.get(m.group(1),m.group(1)),book_title))

prints: 打印:

The Dark Tower - en

If you are only interested in the words of the string to process, we can perform a match linear in the number of characters of the dictionary, with: 如果您只对要处理的字符串的单词感兴趣,我们可以在字典的字符数中执行线性匹配,其中:

filter(None, map(LANGUAGE_TO_ISO.get, book_title.split()))

This will contain a list of ISO codes for matched words (so we do not match 'en' in the word 'men' ). 这将包含匹配单词的ISO代码列表(因此我们在单词'men' 匹配'en' 'men' )。

For example: 例如:

>>> book_title = "The Dark Tower - eng"
>>> list(filter(None, map(LANGUAGE_TO_ISO.get, book_title.split())))
['en']

We can - if we want to - make it even more or less case sensitive (for some special cases, for example characters without a lowercase variant, this will not work) with: 我们可以 - 如果我们想 - 使其或多或少区分大小写(对于某些特殊情况,例如没有小写变体的字符,这将不起作用):

filter(None, map(LANGUAGE_TO_ISO.get, book_title.lower().split()))

(given the keys in the dictionary are all lowercase). (鉴于字典中的键都是小写的)。

If you however want to be able to parse substrings (like 'en' in 'men' ), then you may want to look for a parser (a parser works linear in the input as well, and acts like an annotated finite state machine). 如果你想要能够解析子串 (比如'en' 'men' 'en' ),那么你可能想要寻找一个解析器 (解析器在输入中也是线性的,并且就像一个带注释的有限状态机) 。

Python splits words according to spaces, but dots, etc. will not separate the words. Python根据空格分割单词,但点等不会分隔单词。 You can however split those with for example a regex, like: 但是,您可以使用例如正则表达式来拆分它们,例如:

import re

splt = re.compile('\W+')

filter(None, map(LANGUAGE_TO_ISO.get, splt.split(book_title)))

Or based on your edit: 或者根据您的编辑:

Either a dot, comma, hyphen, space, ... 点,逗号,连字符,空格,......

You can list the characters between square brackets: 您可以列出方括号之间的字符:

import re

splt = re.compile('[\s.,-]+')

filter(None, map(LANGUAGE_TO_ISO.get, splt.split(book_title)))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM