[英]How to remove set of words and their variants (or inflections) from a text file?
我正在嘗試使用python從包含某些單詞及其變體(恐怕是正確的單詞)的文本文件中刪除行。
我的意思是變體:
"Yay","yay",'“Yay','Yay”',"Yay;","Yay?","Yay’s","Yay's",'Yay!','Yay.',"Yay”;"
因此,我嘗試使用以下代碼手動進行操作:
infile1 = open("file1.txt",'r')
outfile1 = open("file2.txt",'w')
word_list = ["Yay","yay",'“Yay','Yay”',"Yay;","Yay?","Yay’s","Yay's",'Yay!','Yay.',"Yay”;"]
for line in infile1:
tempList = line.split()
if any((el in tempList for el in word_list)):
continue
else:
outfile1.write(line)
效果不佳,在word_list
中提到的某些單詞仍然存在於輸出文件中。 還有更多的單詞變體要考慮(例如上帝,上帝!,書,書,書,書等)。
我想知道是否有一種方法可以更有效地執行此操作(可能使用RE!)。
編輯1:
輸入:Sample.txt:
I want my book.
I need my books.
Why you need a book?
Let's go read.
Coming to library
我需要刪除包含"book.","books.", "book?"
所有行"book.","books.", "book?"
從我的sample.txt文件中。
輸出:Fixed.txt:
Let's go read
Coming to library
注意:原始語料庫大約有60,000行
您可以為每行設置一個flag
,並根據flag
值發出,如下所示:
input_sample = [
"I want my book.",
"I need my books.",
"Why you need a book?",
"Let's go read.",
"Coming to library"
]
words = ['book']
result = []
for line in input_sample :
flag = 0 # will be used to check if match is found or not
for word in words :
if word.lower() in line.lower() : # converting both words and lines to lowercase so case is not a factor in matching
flag = 1 # flag values set to 1 on the first match
break # exits the inner for-loop for no more words need to be checked and so next line can be checked
if flag == 0 :
result.append(line) # using lines when there is no match as if-matched, the value of flag would have been 1
print(result)
結果是:
["Let's go read.", 'Coming to library']
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.