如何從文本文件中刪除單詞集及其變體（或詞尾變化）？

Question

我正在嘗試使用python從包含某些單詞及其變體（恐怕是正確的單詞）的文本文件中刪除行。

我的意思是變體：

"Yay","yay",'“Yay','Yay”',"Yay;","Yay?","Yay’s","Yay's",'Yay!','Yay.',"Yay”;"

因此，我嘗試使用以下代碼手動進行操作：

infile1 = open("file1.txt",'r')
outfile1 = open("file2.txt",'w')

word_list = ["Yay","yay",'“Yay','Yay”',"Yay;","Yay?","Yay’s","Yay's",'Yay!','Yay.',"Yay”;"]

for line in infile1:
    tempList = line.split()
    if any((el in tempList for el in word_list)):
        continue
    else:
        outfile1.write(line)

效果不佳，在word_list中提到的某些單詞仍然存在於輸出文件中。 還有更多的單詞變體要考慮（例如上帝，上帝！，書，書，書，書等）。

我想知道是否有一種方法可以更有效地執行此操作（可能使用RE！）。

編輯1：

輸入：Sample.txt：

I want my book.

I need my books.

Why you need a book?

Let's go read.

Coming to library

我需要刪除包含"book.","books.", "book?"所有行"book.","books.", "book?" 從我的sample.txt文件中。

輸出：Fixed.txt：

Let's go read

Coming to library

注意：原始語料庫大約有60,000行

Answer 1

您可以為每行設置一個flag ，並根據flag值發出，如下所示：

input_sample = [
    "I want my book.",
    "I need my books.",
    "Why you need a book?",
    "Let's go read.",
    "Coming to library"
]
words = ['book']
result = []
for line in input_sample : 
    flag = 0    # will be used to check if match is found or not
    for word in words : 
        if word.lower() in line.lower() :    # converting both words and lines to lowercase so case is not a factor in matching
            flag = 1    # flag values set to 1 on the first match
            break    # exits the inner for-loop for no more words need to be checked and so next line can be checked
    if flag == 0 :                      
        result.append(line)    # using lines when there is no match as if-matched, the value of flag would have been 1

print(result)

結果是：

["Let's go read.", 'Coming to library']

如何從文本文件中刪除單詞集及其變體（或詞尾變化）？

問題描述

1 個解決方案

解決方案1
2 已采納 2017-04-04 16:45:21

如何從文本文件中刪除單詞集及其變體（或詞尾變化）？

問題描述

1 個解決方案

解決方案1 2 已采納 2017-04-04 16:45:21

解決方案1
2 已采納 2017-04-04 16:45:21