[英]Search for any word or combination of words from one string in a list (python)
I have a string (for example: "alpha beta charlie, delta&epsilon foxtrot"
) and a list (for example ["zero","omega virginia","apple beta charlie"]
). 我有一个字符串(例如:
"alpha beta charlie, delta&epsilon foxtrot"
)和一个列表(例如["zero","omega virginia","apple beta charlie"]
)。 Is there a convenient way to iterate through every word and combination of words in the string in order to search for it in the list? 是否有方便的方法来遍历字符串中的每个单词和单词组合以在列表中进行搜索?
You're saying combinations, but combinations are semantically unordered, what you mean, is you intend to find the intersection of all ordered permutations joined by spaces with a target list. 您说的是组合,但是组合在语义上是无序的,这意味着要查找由空格与目标列表连接的所有有序排列的交集。
To begin with, we need to import the libraries we intend to use. 首先,我们需要导入要使用的库。
import re
import itertools
Don't split on characters, you're doing a semantic search for words exclusive of strange characters. 不要对字符进行区分,您正在对不包含奇怪字符的单词进行语义搜索。 Regular expressions, powered by the
re
module are perfect for this. 由
re
模块提供支持的正则表达式非常适合此操作。 In a raw Python string, r''
, we use the regular expression for the edge of a word, \\b
, around any alphanumeric character (and _
), \\w
, of number greater than or equal to one, +
. 在原始的Python字符串
r''
,我们使用正则表达式来表示单词\\b
的边缘,在任何大于或等于+
字母数字字符(和_
) \\w
周围。
re.findall
returns a list of every match. re.findall
返回每个匹配项的列表。
re_pattern = r'\b\w+\b'
silly_string = 'alpha beta charlie, delta&epsilon foxtrot'
words = re.findall(re_pattern, silly_string)
Here, words is our wordlist: 在这里,单词是我们的单词列表:
>>> print words
['alpha', 'beta', 'charlie', 'delta', 'epsilon', 'foxtrot']
Continuing, we prefer to manipulate our data with generators to avoid unnecessarily materializing data before we need it and holding large datasets in memory. 继续,我们更喜欢使用生成器来处理数据,以避免在需要数据之前不必要地物化数据并将大型数据集保存在内存中。 The itertools library has some nice functions that neatly suit our needs for providing all permutations of the above words and chaining them in a single iterable:
itertools库具有一些不错的功能,这些功能完全可以满足我们提供上述单词的所有排列并将它们链接到单个可迭代对象中的需求:
_gen = (itertools.permutations(words, i + 1) for i in xrange(len(words)))
all_permutations_gen = itertools.chain(*_gen)
listing all_permutations_gen with list(all_permutations_gen)
would give us: 用
list(all_permutations_gen)
会给我们:
[('alpha',), ('beta',), ('charlie',), ('delta',), ('epsilon',), ('foxtrot',), ('alpha', 'beta'), ('alpha', 'charlie'), ('alpha', 'delta'), ('alpha', 'epsilon'), ('alpha', 'foxtrot'), ('beta', 'alpha'), ('beta', 'charlie'), ('beta', 'delta'), ('beta', 'epsilon'), ('beta', 'foxtrot'), ('charlie', 'alpha'), ('charlie', 'beta'), ('charlie', 'delta'), ('charlie', 'epsilon'), ('charlie', 'foxtrot'), ('delta', 'alpha'), ('delta', 'beta'), ('delta', 'charlie'), ('delta', 'epsilon'), ('delta', 'foxtrot'), ('epsilon', 'alpha'), ('epsilon', 'beta'), ('epsilon', 'charlie'), ('epsilon', 'delta'), ('epsilon', 'foxtrot'), ('foxtrot', 'alpha'), ('foxtrot', 'beta'), ('foxtrot', 'charlie'), ('foxtrot', 'delta'), ('foxtrot', 'epsilon'), ('alpha', 'beta', 'charlie'), ('alpha', 'beta', 'delta'), ...
[('alpha',),('beta',),('charlie',),('delta',),('epsilon',),('foxtrot',),('alpha','beta '),('alpha','charlie'),('alpha','delta'),('alpha','epsilon'),('alpha','foxtrot'),('beta','alpha '),('beta','charlie'),('beta','delta'),('beta','epsilon'),('beta','foxtrot'),('charlie','alpha '),('charlie','beta'),('charlie','delta'),('charlie','epsilon'),('charlie','foxtrot'),('delta','alpha '),('delta','beta'),('delta','charlie'),('delta','epsilon'),('delta','foxtrot'),('epsilon','alpha '),('epsilon','beta'),('epsilon','charlie'),('epsilon','delta'),('epsilon','foxtrot'),('foxtrot','alpha '),('foxtrot','beta'),('foxtrot','charlie'),('foxtrot','delta'),('foxtrot','epsilon'),('alpha','beta ','charlie'),('alpha','beta','delta'),...
If we materialized the generator in a list instead of a set, printing the first 20 items would show us: 如果我们将生成器具体化为列表而不是列表,那么打印前20个项目将向我们显示:
>>> print all_permutations[:20] # this only works if you cast as a list instead
['alpha', 'beta', 'charlie', 'delta', 'epsilon', 'foxtrot', 'alpha beta', 'alpha charlie', 'alpha delta', 'alpha epsilon', 'alpha foxtrot', 'beta alpha', 'beta charlie', 'beta delta', 'beta epsilon', 'beta foxtrot', 'charlie alpha', 'charlie beta', 'charlie delta', 'charlie epsilon']
But that would exhaust the generator before we're ready. 但这会在我们准备好之前耗尽发电机的能量。 So instead, now we get the set of all permutations of those words
因此,现在我们得到这些单词的所有排列的集合
all_permutations = set(' '.join(i) for i in all_permutations_gen)
So we see with this we can now search for an intersection with the target list: 这样,我们现在可以搜索与目标列表的交集:
>>> target_list = ["zero","omega virginia","apple beta charlie"]
>>> all_permutations.intersection(target_list)
set([])
And in this case, for the examples given, we get the empty set, but if we have a string in the target that's in our set of permutations: 在这种情况下,对于给定的示例,我们得到了一个空集,但是如果我们在排列集中的目标中有一个字符串,则:
>>> target_list_2 = ["apple beta charlie", "foxtrot alpha beta charlie"]
>>> all_permutations.intersection(target_list_2)
set(['foxtrot alpha beta charlie'])
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.