Python - 删除包含特定字符串的行，而不是所有包含部分单词的字符串

Question

I have 2 txt-docs.我有 2 个 txt 文档。 One contains some sentences and one contains some bad-words.一个包含一些句子，一个包含一些坏词。 I wanna find all sentences containing a word from the bad-word-list and remove that line (the whole sentence).我想从坏词列表中找到所有包含一个词的句子并删除该行（整个句子）。 But only when a word from the bad-word-list stands alone, not if it is part of another word.但只有当坏词列表中的一个词单独存在时，而不是当它是另一个词的一部分时。 For example, I want to remove "on" but not "onsite".例如，我想删除“on”而不是“onsite”。 Any advice?有什么建议吗？

#bad_words = ["on", "off"]
#sentences = ["Learning Python is an ongoing task", "I practice on and off", "I do it offline", "On weekdays i practice the most", "In weekends I am off"]

def clean_sentences(sentences,bad_words, outfile, badfile):
    bad_words_list = []
    with open(bad_words) as wo:
        bad_words_list=wo.readlines()
        b_lists=list(map(str.strip, bad_words_list))
        for line in b_lists:
            line=line.strip('\n')
            line=line.lower()            
            bad_words_list.insert(len(bad_words_list),line)
    with open(sentences) as oldfile, open(outfile, 'w') as newfile, open(badfile, 'w') as badwords:
        for line in oldfile:
            if not any(bad_word in line for bad_word in bad_words):
                newfile.write(line)
            else:
                badwords.write(line)

clean_sentences('sentences.txt', 'bad_words.txt', 'outfile.txt', 'badfile.txt')

Answer 1

Instead of checking if any of the bad words is in a sentence, you should check if any of the bad words is in the split of the sentence (so you only get the bad words when they are separate words in a sentence and not just an arbitrary substring of it)与其检查句子中是否有任何坏词，不如检查句子的split中是否有任何坏词（因此，只有当它们是句子中的单独单词而不只是一个任意substring）

Here is a simplified version of your code (without the file handling)这是您的代码的简化版本（没有文件处理）

bad_words = ["on", "off"]
sentences = ["Learning Python is an ongoing task", "I practice on and off", "I do it offline", "On weekdays i practice the most", "In weekends I am off"]

def clean_sentences(sentences, bad_words):
    for sentence in sentences:
        if any(word in map(lambda str: str.lower(), sentence.split()) for word in bad_words):
            print(f'Found bad word in {sentence}')

clean_sentences(sentences, bad_words)

# output
Found bad word in I practice on and off
Found bad word in On weekdays i practice the most
Found bad word in In weekends I am off

With regards to your own code, just update关于您自己的代码，只需更新

            if not any(bad_word in line for bad_word in bad_words):
                newfile.write(line)

to至

            if not any(bad_word in map(lambda str: str.lower(), line.split()) for bad_word in bad_words):
                newfile.write(line)

EDIT : in order to make the search case-insensitive, use the lower case version of the words in the sentence (assuming the bad words are themselves lower case).编辑：为了使搜索不区分大小写，请使用句子中单词的小写版本（假设坏词本身是小写的）。 I've updated the code with a map and a simple lambda function我已经用map和简单的lambda function 更新了代码

Python - 删除包含特定字符串的行，而不是所有包含部分单词的字符串

问题描述

1 个解决方案

解决方案1
1 已采纳 2020-04-09 09:52:36

Python - 删除包含特定字符串的行，而不是所有包含部分单词的字符串

问题描述

1 个解决方案

解决方案1 1 已采纳 2020-04-09 09:52:36

解决方案1
1 已采纳 2020-04-09 09:52:36