简体   繁体   English

如果我有一个单词列表,如何检查字符串是否包含列表中的任何单词,并且有效?

[英]If I have a list of words, how can I check if string does not contain any of the words in the list, and efficiently?

As title says, I have a list of words, Like stopWords = ["the", "and", "with", etc...] and I'm receiving text like "Kill the fox and dog". 正如标题所说,我有一个单词列表,如stopWords = ["the", "and", "with", etc...] ,我收到的文字就像“杀死狐狸和狗”。 I want the output like "Kill fox dog" very efficiently and fast. 我希望输出像“杀死狐狸狗”非常有效和快速。 How can I do this (I know I can iterate using a for loop, but thats not very efficient) 我怎么能这样做(我知道我可以使用for循环迭代,但那不是很有效)

The most imporant improvement is to make stopWords a set . 最重要的改进是使stopWords成为set This means the lookups will be very fast 这意味着查找速度非常快

stopWords = set(["the", "and", "with", etc...])
" ".join(word for word in msg.split() if word not in stopWords)

If you just want to know if any of the stopWords are in the text 如果您只是想知道文本中是否有任何stopWords

if any(word in stopWords for word in msg.split()):
    ...

With Python the fastest operation will be making "stopwords" a set instead of a list and checking directly for membership with "x in stopwords". 使用Python,最快的操作是将“停用词”设置为一组而不是列表,并使用“x in stopwords”直接检查成员身份。 This structure is designed to be fast for this sort of operation. 这种结构设计为快速进行这种操作。

See the set documentation 请参阅设置文档

Using list comprehension : 使用列表理解

stopWords = ["the", "and", "with"]
msg = "kill the fox and the dog"

' '.join([w for w in msg.split() if w not in stopWords])

gives: 得到:

'kill fox dog'
  1. Put your original list of words in a dictionary. 将原始单词列表放在字典中。
  2. Iterate through the characters in the given string, using space as a delimiter for a word. 遍历给定字符串中的字符,使用空格作为单词的分隔符。 Look up each word in the dictionary. 查找字典中的每个单词。

Have your stopWords in a set() (as others have suggested), accumulate your other words into a working set then simply take the set difference using working = working - stopWords ... to have a working set with all of the stopWords filtered out of it. set()set()你的stopWords(正如其他人所建议的那样),将你的其他单词累积到一个工作集中然后使用working = working - stopWords ...来设置一个工作集,其中所有的stopWords被过滤掉了它的。 Or just to check of the existence of such words use a conditional. 或者只是检查这些单词是否存在使用条件。 For example: 例如:

#!python
stopWords = set('the a an and'.split())
working   = set('this is a test of the one working set dude'.split())
if working == working - stopWords:
    print "The working set contains no stop words"
else:
    print "Actually, it does"

There are actually more efficient data structures, such as a trie which could be used for large, relatively dense, set of stop words. 实际上有更高效的数据结构,例如可用于大型,相对密集的停用词组的trie You can find trie modules for Python, though I didn't see any written as binary (C) extensions and I wonder where the cross-over point would be between a trie implemented in pure Python vs. use of Python's set() support. 你可以找到Python的trie模块,虽然我没有看到任何编写的二进制(C)扩展,我想知道在纯Python中实现的trie与使用Python的set()支持之间的交叉点。 (Might also be a good case for Cython , though). (但也可能是Cython的好例子)。

In fact I see that someone has tackled that question separately here SO: How do I create a fixed length mutable array of python objects in cython . 事实上,我看到有人在这里单独解决了这个问题所以:我如何在cython中创建一个固定长度的可变数组的python对象

Ultimately, of course, you should create the simple set-based version, test and profile it, then, if necessary, try trie and Cython-trie variants as possible improvements. 当然,最终你应该创建简单的基于集合的版本,测试并分析它,然后,如果有必要,尝试trie和Cython-trie变体作为可能的改进。

As an alternative you can assemble your list in a regex and replace stop words along with surrounding spaces by a single space. 作为替代方案,您可以在正则表达式中组合列表,并用单个空格替换停用词和周围空格。

import re
stopWords = ["the", "and", "with"]
input = "Kill the fox and dog"
pattern = "\\s{:s}\\s".format("\\s|\\s".join(stopWords))
print(pattern)
print(re.sub(pattern, " ", input))

will output 将输出

\sthe\s|\sand\s|\swith\s
Kill fox dog

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 字符串是否包含我列表中的任何单词? - Does string contain any of the words in my list? 在Python中,如何检查字符串是否不包含列表中的任何字符串? - In Python, how can I check that a string does not contain any string from a list? 如何根据单词列表对字符串的单词进行分组? - How can I group words of a String based on a list of words? 如何将字符串拆分为单词列表? - How do I split a string into a list of words? 如何检查列表中的单词是否包含在另一个列表中的句子中? - How do I check if words in a list are contained in sentences in another list? Python:我有一个单词列表,想检查这些单词在文件的每一行中出现的次数 - Python: I have a list of words, and want to check the number of occurrences of those words in each line in a file 如何有效地检查单词列表是否包含在 Spark Dataframe 中? - How to efficiently check if a list of words is contained in a Spark Dataframe? 如何检查列表中的任何单词是否在列中的每一行中 - How do I check to see if any of the words in a list are in each row within a column Python - 如何检查列表中字符串中是否包含多个单词 - Python - How to check if multiple words in string in a list 如何使用 Speech_recognizer 检查我是否说了列表中包含的单词之一 - How can I check if I said one of the words contained in a list using speech_recognizer
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM