如何从文本文件中删除单词集及其变体（或词尾变化）？

Question

I am trying to remove lines from a text file that contains certain words and their variants (I'm afraid it's the correct word) using python. 我正在尝试使用python从包含某些单词及其变体（恐怕是正确的单词）的文本文件中删除行。

What I mean by variants: 我的意思是变体：

"Yay","yay",'“Yay','Yay”',"Yay;","Yay?","Yay’s","Yay's",'Yay!','Yay.',"Yay”;"

So, I tried doing it manually using the following code: 因此，我尝试使用以下代码手动进行操作：

infile1 = open("file1.txt",'r')
outfile1 = open("file2.txt",'w')

word_list = ["Yay","yay",'“Yay','Yay”',"Yay;","Yay?","Yay’s","Yay's",'Yay!','Yay.',"Yay”;"]

for line in infile1:
    tempList = line.split()
    if any((el in tempList for el in word_list)):
        continue
    else:
        outfile1.write(line)

It didn't work out well, some of the words mentioned in word_list were still present in the output file. 效果不佳，在word_list中提到的某些单词仍然存在于输出文件中。 There are lots of more word variants to consider (like God, God!, book, Book, books, books? etc). 还有更多的单词变体要考虑（例如上帝，上帝！，书，书，书，书等）。

I was wondering if there is a way to do it more efficiently (with RE may be!). 我想知道是否有一种方法可以更有效地执行此操作（可能使用RE！）。

EDIT 1: 编辑1：

Input: Sample.txt: 输入：Sample.txt：

I want my book.

I need my books.

Why you need a book?

Let's go read.

Coming to library

I need to remove all the lines containing "book.","books.", "book?" 我需要删除包含"book.","books.", "book?"所有行"book.","books.", "book?" from my sample.txt file. 从我的sample.txt文件中。

Output: Fixed.txt: 输出：Fixed.txt：

Let's go read

Coming to library

NOTE: The original corpus has around 60,000 lines 注意：原始语料库大约有60,000行

Answer 1

You can set a flag for every line and emit based on the flag value, something like this : 您可以为每行设置一个flag ，并根据flag值发出，如下所示：

input_sample = [
    "I want my book.",
    "I need my books.",
    "Why you need a book?",
    "Let's go read.",
    "Coming to library"
]
words = ['book']
result = []
for line in input_sample : 
    flag = 0    # will be used to check if match is found or not
    for word in words : 
        if word.lower() in line.lower() :    # converting both words and lines to lowercase so case is not a factor in matching
            flag = 1    # flag values set to 1 on the first match
            break    # exits the inner for-loop for no more words need to be checked and so next line can be checked
    if flag == 0 :                      
        result.append(line)    # using lines when there is no match as if-matched, the value of flag would have been 1

print(result)

This results in : 结果是：

["Let's go read.", 'Coming to library']

如何从文本文件中删除单词集及其变体（或词尾变化）？

问题描述

1 个解决方案

解决方案1
2 已采纳 2017-04-04 16:45:21

如何从文本文件中删除单词集及其变体（或词尾变化）？

问题描述

1 个解决方案

解决方案1 2 已采纳 2017-04-04 16:45:21

解决方案1
2 已采纳 2017-04-04 16:45:21