简体   繁体   English

从字符串中删除单词列表

[英]Removing list of words from a string

I have a list of stopwords.我有一个停用词列表。 And I have a search string.我有一个搜索字符串。 I want to remove the words from the string.我想从字符串中删除单词。

As an example:举个例子:

stopwords=['what','who','is','a','at','is','he']
query='What is hello'

Now the code should strip 'What' and 'is'.现在代码应该去掉“什么”和“是”。 However in my case it strips 'a', as well as 'at'.但是,在我的情况下,它去掉了“a”和“at”。 I have given my code below.我在下面给出了我的代码。 What could I be doing wrong?我可能做错了什么?

for word in stopwords:
    if word in query:
        print word
        query=query.replace(word,"")

If the input query is "What is Hello", I get the output as:如果输入查询是“什么是你好”,我得到的输出为:
wht s llo

Why does this happen?为什么会发生这种情况?

This is one way to do it:这是一种方法:

query = 'What is hello'
stopwords = ['what', 'who', 'is', 'a', 'at', 'is', 'he']
querywords = query.split()

resultwords  = [word for word in querywords if word.lower() not in stopwords]
result = ' '.join(resultwords)

print(result)

I noticed that you want to also remove a word if its lower-case variant is in the list, so I've added a call to lower() in the condition check.我注意到如果列表中的小写变体,您还想删除一个单词,因此我在条件检查中添加了对lower()的调用。

the accepted answer works when provided a list of words separated by spaces, but that's not the case in real life when there can be punctuation to separate the words.当提供由空格分隔的单词列表时,接受的答案有效,但在现实生活中,当可以使用标点符号分隔单词时,情况并非如此。 In that case re.split is required.在这种情况下,需要re.split

Also, testing against stopwords as a set makes lookup faster (even if there's a tradeoff between string hashing & lookup when there's a small number of words)此外,将stopwords作为一set进行测试可以加快查找速度(即使在字数较少时字符串散列和查找之间存在折衷)

My proposal:我的建议:

import re

query = 'What is hello? Says Who?'
stopwords = {'what','who','is','a','at','is','he'}

resultwords  = [word for word in re.split("\W+",query) if word.lower() not in stopwords]
print(resultwords)

output (as list of words):输出(作为单词列表):

['hello','Says','']

There's a blank string in the end, because re.split annoyingly issues blank fields, that needs filtering out.最后有一个空白字符串,因为re.split令人讨厌地发出空白字段,需要过滤掉。 2 solutions here:这里有2个解决方案:

resultwords  = [word for word in re.split("\W+",query) if word and word.lower() not in stopwords]  # filter out empty words

or add empty string to the list of stopwords :)或将空字符串添加到停用词列表中 :)

stopwords = {'what','who','is','a','at','is','he',''}

now the code prints:现在代码打印:

['hello','Says']

building on what karthikr said, try建立在 karthikr 所说的基础上,尝试

' '.join(filter(lambda x: x.lower() not in stopwords,  query.split()))

explanation:解释:

query.split() #splits variable query on character ' ', e.i. "What is hello" -> ["What","is","hello"]

filter(func,iterable) #takes in a function and an iterable (list/string/etc..) and
                      # filters it based on the function which will take in one item at
                      # a time and return true.false

lambda x: x.lower() not in stopwords   # anonymous function that takes in variable,
                                       # converts it to lower case, and returns true if
                                       # the word is not in the iterable stopwords


' '.join(iterable) #joins all items of the iterable (items must be strings/chars)
                   #using the string/char in front of the dot, i.e. ' ' as a joiner.
                   # i.e. ["What", "is","hello"] -> "What is hello"

Looking at the other answers to your question I noticed that they told you how to do what you are trying to do, but they did not answer the question you posed at the end.查看您问题的其他答案,我注意到他们告诉了您如何做您想做的事情,但他们没有回答您最后提出的问题。

If the input query is "What is Hello", I get the output as:如果输入查询是“什么是你好”,我得到的输出为:

wht s llo

Why does this happen?为什么会发生这种情况?

This happens because .replace() replaces the substring you give it exactly.发生这种情况是因为 .replace() 完全替换了您给它的子字符串。

for example:例如:

"My, my! Hello my friendly mystery".replace("my", "")

gives:给出:

>>> "My, ! Hello  friendly stery"

.replace() is essentially splitting the string by the substring given as the first parameter and joining it back together with the second parameter. .replace() 本质上是通过作为第一个参数给出的子字符串拆分字符串,并将其与第二个参数连接在一起。

"hello".replace("he", "je")

is logically similar to:在逻辑上类似于:

"je".join("hello".split("he"))

If you were still wanting to use .replace to remove whole words you might think adding a space before and after would be enough, but this leaves out words at the beginning and end of the string as well as punctuated versions of the substring.如果您仍然想使用 .replace 删除整个单词,您可能认为在前后添加一个空格就足够了,但这会遗漏字符串开头和结尾的单词以及子字符串的标点符号。

"My, my! hello my friendly mystery".replace(" my ", " ")
>>> "My, my! hello friendly mystery"

"My, my! hello my friendly mystery".replace(" my", "")
>>> "My,! hello friendlystery"

"My, my! hello my friendly mystery".replace("my ", "")
>>> "My, my! hello friendly mystery"

Additionally, adding spaces before and after will not catch duplicates as it has already processed the first sub-string and will ignore it in favor of continuing on:此外,在前后添加空格不会捕获重复项,因为它已经处理了第一个子字符串并将忽略它以继续:

"hello my my friend".replace(" my ", " ")
>>> "hello my friend"

For these reasons your accepted answer by Robby Cornelissen is the recommended way to do what you are wanting.由于这些原因, Robby Cornelissen接受的答案是推荐的方法来做你想做的事。

stopwords=['for','or','to']
p='Asking for help, clarification, or responding to other answers.'
for i in stopwords:
  n=p.replace(i,'')
  p=n
print(p)
" ".join([x for x in query.split() if x not in stopwords])

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM