簡體   English   中英

如何從文本文件中刪除單詞集及其變體(或詞尾變化)?

[英]How to remove set of words and their variants (or inflections) from a text file?

我正在嘗試使用python從包含某些單詞及其變體(恐怕是正確的單詞)的文本文件中刪除行。

我的意思是變體:

"Yay","yay",'“Yay','Yay”',"Yay;","Yay?","Yay’s","Yay's",'Yay!','Yay.',"Yay”;"

因此,我嘗試使用以下代碼手動進行操作:

infile1 = open("file1.txt",'r')
outfile1 = open("file2.txt",'w')

word_list = ["Yay","yay",'“Yay','Yay”',"Yay;","Yay?","Yay’s","Yay's",'Yay!','Yay.',"Yay”;"]

for line in infile1:
    tempList = line.split()
    if any((el in tempList for el in word_list)):
        continue
    else:
        outfile1.write(line)

效果不佳,在word_list中提到的某些單詞仍然存在於輸出文件中。 還有更多的單詞變體要考慮(例如上帝,上帝!,書,書,書,書等)。

我想知道是否有一種方法可以更有效地執行此操作(可能使用RE!)。

編輯1:

輸入:Sample.txt:

I want my book.

I need my books.

Why you need a book?

Let's go read.

Coming to library

我需要刪除包含"book.","books.", "book?"所有行"book.","books.", "book?" 從我的sample.txt文件中。

輸出:Fixed.txt:

Let's go read

Coming to library

注意:原始語料庫大約有60,000行

您可以為每行設置一個flag ,並根據flag值發出,如下所示:

input_sample = [
    "I want my book.",
    "I need my books.",
    "Why you need a book?",
    "Let's go read.",
    "Coming to library"
]
words = ['book']
result = []
for line in input_sample : 
    flag = 0    # will be used to check if match is found or not
    for word in words : 
        if word.lower() in line.lower() :    # converting both words and lines to lowercase so case is not a factor in matching
            flag = 1    # flag values set to 1 on the first match
            break    # exits the inner for-loop for no more words need to be checked and so next line can be checked
    if flag == 0 :                      
        result.append(line)    # using lines when there is no match as if-matched, the value of flag would have been 1

print(result)

結果是:

["Let's go read.", 'Coming to library']

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM