[英]If I have a list of words, how can I check if string does not contain any of the words in the list, and efficiently?
As title says, I have a list of words, Like stopWords = ["the", "and", "with", etc...]
and I'm receiving text like "Kill the fox and dog". 正如标题所说,我有一个单词列表,如stopWords = ["the", "and", "with", etc...]
,我收到的文字就像“杀死狐狸和狗”。 I want the output like "Kill fox dog" very efficiently and fast. 我希望输出像“杀死狐狸狗”非常有效和快速。 How can I do this (I know I can iterate using a for loop, but thats not very efficient) 我怎么能这样做(我知道我可以使用for循环迭代,但那不是很有效)
The most imporant improvement is to make stopWords a set
. 最重要的改进是使stopWords成为set
。 This means the lookups will be very fast 这意味着查找速度非常快
stopWords = set(["the", "and", "with", etc...])
" ".join(word for word in msg.split() if word not in stopWords)
If you just want to know if any of the stopWords are in the text 如果您只是想知道文本中是否有任何stopWords
if any(word in stopWords for word in msg.split()):
...
With Python the fastest operation will be making "stopwords" a set instead of a list and checking directly for membership with "x in stopwords". 使用Python,最快的操作是将“停用词”设置为一组而不是列表,并使用“x in stopwords”直接检查成员身份。 This structure is designed to be fast for this sort of operation. 这种结构设计为快速进行这种操作。
Using list comprehension : 使用列表理解 :
stopWords = ["the", "and", "with"]
msg = "kill the fox and the dog"
' '.join([w for w in msg.split() if w not in stopWords])
gives: 得到:
'kill fox dog'
Have your stopWords in a set()
(as others have suggested), accumulate your other words into a working set then simply take the set difference using working = working - stopWords
... to have a working set with all of the stopWords filtered out of it. 在set()
中set()
你的stopWords(正如其他人所建议的那样),将你的其他单词累积到一个工作集中然后使用working = working - stopWords
...来设置一个工作集,其中所有的stopWords被过滤掉了它的。 Or just to check of the existence of such words use a conditional. 或者只是检查这些单词是否存在使用条件。 For example: 例如:
#!python
stopWords = set('the a an and'.split())
working = set('this is a test of the one working set dude'.split())
if working == working - stopWords:
print "The working set contains no stop words"
else:
print "Actually, it does"
There are actually more efficient data structures, such as a trie which could be used for large, relatively dense, set of stop words. 实际上有更高效的数据结构,例如可用于大型,相对密集的停用词组的trie 。 You can find trie modules for Python, though I didn't see any written as binary (C) extensions and I wonder where the cross-over point would be between a trie implemented in pure Python vs. use of Python's set()
support. 你可以找到Python的trie模块,虽然我没有看到任何编写的二进制(C)扩展,我想知道在纯Python中实现的trie与使用Python的set()
支持之间的交叉点。 (Might also be a good case for Cython , though). (但也可能是Cython的好例子)。
In fact I see that someone has tackled that question separately here SO: How do I create a fixed length mutable array of python objects in cython . 事实上,我看到有人在这里单独解决了这个问题所以:我如何在cython中创建一个固定长度的可变数组的python对象 。
Ultimately, of course, you should create the simple set-based version, test and profile it, then, if necessary, try trie and Cython-trie variants as possible improvements. 当然,最终你应该创建简单的基于集合的版本,测试并分析它,然后,如果有必要,尝试trie和Cython-trie变体作为可能的改进。
As an alternative you can assemble your list in a regex and replace stop words along with surrounding spaces by a single space. 作为替代方案,您可以在正则表达式中组合列表,并用单个空格替换停用词和周围空格。
import re
stopWords = ["the", "and", "with"]
input = "Kill the fox and dog"
pattern = "\\s{:s}\\s".format("\\s|\\s".join(stopWords))
print(pattern)
print(re.sub(pattern, " ", input))
will output 将输出
\sthe\s|\sand\s|\swith\s
Kill fox dog
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.