从字符串中删除单词列表

Question

我有一个停用词列表。 我有一个搜索字符串。 我想从字符串中删除单词。

举个例子：

stopwords=['what','who','is','a','at','is','he']
query='What is hello'

现在代码应该去掉“什么”和“是”。 但是，在我的情况下，它去掉了“a”和“at”。 我在下面给出了我的代码。 我可能做错了什么？

for word in stopwords:
    if word in query:
        print word
        query=query.replace(word,"")

如果输入查询是“什么是你好”，我得到的输出为：
wht s llo

为什么会发生这种情况？

Answer 1

这是一种方法：

query = 'What is hello'
stopwords = ['what', 'who', 'is', 'a', 'at', 'is', 'he']
querywords = query.split()

resultwords  = [word for word in querywords if word.lower() not in stopwords]
result = ' '.join(resultwords)

print(result)

我注意到如果列表中的小写变体，您还想删除一个单词，因此我在条件检查中添加了对lower()的调用。

Answer 2

当提供由空格分隔的单词列表时，接受的答案有效，但在现实生活中，当可以使用标点符号分隔单词时，情况并非如此。 在这种情况下，需要re.split 。

此外，将stopwords作为一set进行测试可以加快查找速度（即使在字数较少时字符串散列和查找之间存在折衷）

我的建议：

import re

query = 'What is hello? Says Who?'
stopwords = {'what','who','is','a','at','is','he'}

resultwords  = [word for word in re.split("\W+",query) if word.lower() not in stopwords]
print(resultwords)

输出（作为单词列表）：

['hello','Says','']

最后有一个空白字符串，因为re.split令人讨厌地发出空白字段，需要过滤掉。 这里有2个解决方案：

resultwords  = [word for word in re.split("\W+",query) if word and word.lower() not in stopwords]  # filter out empty words

或将空字符串添加到停用词列表中 :)

stopwords = {'what','who','is','a','at','is','he',''}

现在代码打印：

['hello','Says']

Answer 3

建立在 karthikr 所说的基础上，尝试

' '.join(filter(lambda x: x.lower() not in stopwords,  query.split()))

解释：

query.split() #splits variable query on character ' ', e.i. "What is hello" -> ["What","is","hello"]

filter(func,iterable) #takes in a function and an iterable (list/string/etc..) and
                      # filters it based on the function which will take in one item at
                      # a time and return true.false

lambda x: x.lower() not in stopwords   # anonymous function that takes in variable,
                                       # converts it to lower case, and returns true if
                                       # the word is not in the iterable stopwords


' '.join(iterable) #joins all items of the iterable (items must be strings/chars)
                   #using the string/char in front of the dot, i.e. ' ' as a joiner.
                   # i.e. ["What", "is","hello"] -> "What is hello"

Answer 4

查看您问题的其他答案，我注意到他们告诉了您如何做您想做的事情，但他们没有回答您最后提出的问题。

如果输入查询是“什么是你好”，我得到的输出为：

wht s llo

为什么会发生这种情况？

发生这种情况是因为 .replace() 完全替换了您给它的子字符串。

例如：

"My, my! Hello my friendly mystery".replace("my", "")

给出：

>>> "My, ! Hello  friendly stery"

.replace() 本质上是通过作为第一个参数给出的子字符串拆分字符串，并将其与第二个参数连接在一起。

"hello".replace("he", "je")

在逻辑上类似于：

"je".join("hello".split("he"))

如果您仍然想使用 .replace 删除整个单词，您可能认为在前后添加一个空格就足够了，但这会遗漏字符串开头和结尾的单词以及子字符串的标点符号。

"My, my! hello my friendly mystery".replace(" my ", " ")
>>> "My, my! hello friendly mystery"

"My, my! hello my friendly mystery".replace(" my", "")
>>> "My,! hello friendlystery"

"My, my! hello my friendly mystery".replace("my ", "")
>>> "My, my! hello friendly mystery"

此外，在前后添加空格不会捕获重复项，因为它已经处理了第一个子字符串并将忽略它以继续：

"hello my my friend".replace(" my ", " ")
>>> "hello my friend"

由于这些原因， Robby Cornelissen 接受的答案是推荐的方法来做你想做的事。

Answer 5

stopwords=['for','or','to']
p='Asking for help, clarification, or responding to other answers.'
for i in stopwords:
  n=p.replace(i,'')
  p=n
print(p)

Answer 6

" ".join([x for x in query.split() if x not in stopwords])

从字符串中删除单词列表

问题描述

6 个解决方案

解决方案1
59 已采纳 2014-08-17 03:36:54

解决方案2
12 2018-01-01 17:19:57

解决方案3
8 2014-08-17 03:33:01

解决方案4
6 2017-12-28 21:10:21

解决方案5
0 2020-08-24 07:38:19

解决方案6
-1 2021-04-13 07:59:30

从字符串中删除单词列表

问题描述

6 个解决方案

解决方案1 59 已采纳 2014-08-17 03:36:54

解决方案2 12 2018-01-01 17:19:57

解决方案3 8 2014-08-17 03:33:01

解决方案4 6 2017-12-28 21:10:21

解决方案5 0 2020-08-24 07:38:19

解决方案6 -1 2021-04-13 07:59:30

解决方案1
59 已采纳 2014-08-17 03:36:54

解决方案2
12 2018-01-01 17:19:57

解决方案3
8 2014-08-17 03:33:01

解决方案4
6 2017-12-28 21:10:21

解决方案5
0 2020-08-24 07:38:19

解决方案6
-1 2021-04-13 07:59:30