[英]How to remove set of words and their variants (or inflections) from a text file?
I am trying to remove lines from a text file that contains certain words and their variants (I'm afraid it's the correct word) using python. 我正在尝试使用python从包含某些单词及其变体(恐怕是正确的单词)的文本文件中删除行。
What I mean by variants: 我的意思是变体:
"Yay","yay",'“Yay','Yay”',"Yay;","Yay?","Yay’s","Yay's",'Yay!','Yay.',"Yay”;"
So, I tried doing it manually using the following code: 因此,我尝试使用以下代码手动进行操作:
infile1 = open("file1.txt",'r')
outfile1 = open("file2.txt",'w')
word_list = ["Yay","yay",'“Yay','Yay”',"Yay;","Yay?","Yay’s","Yay's",'Yay!','Yay.',"Yay”;"]
for line in infile1:
tempList = line.split()
if any((el in tempList for el in word_list)):
continue
else:
outfile1.write(line)
It didn't work out well, some of the words mentioned in word_list
were still present in the output file. 效果不佳,在
word_list
中提到的某些单词仍然存在于输出文件中。 There are lots of more word variants to consider (like God, God!, book, Book, books, books? etc). 还有更多的单词变体要考虑(例如上帝,上帝!,书,书,书,书等)。
I was wondering if there is a way to do it more efficiently (with RE may be!). 我想知道是否有一种方法可以更有效地执行此操作(可能使用RE!)。
EDIT 1: 编辑1:
Input: Sample.txt: 输入:Sample.txt:
I want my book.
I need my books.
Why you need a book?
Let's go read.
Coming to library
I need to remove all the lines containing "book.","books.", "book?"
我需要删除包含
"book.","books.", "book?"
所有行"book.","books.", "book?"
from my sample.txt file. 从我的sample.txt文件中。
Output: Fixed.txt: 输出:Fixed.txt:
Let's go read
Coming to library
NOTE: The original corpus has around 60,000 lines 注意:原始语料库大约有60,000行
You can set a flag
for every line and emit based on the flag
value, something like this : 您可以为每行设置一个
flag
,并根据flag
值发出,如下所示:
input_sample = [
"I want my book.",
"I need my books.",
"Why you need a book?",
"Let's go read.",
"Coming to library"
]
words = ['book']
result = []
for line in input_sample :
flag = 0 # will be used to check if match is found or not
for word in words :
if word.lower() in line.lower() : # converting both words and lines to lowercase so case is not a factor in matching
flag = 1 # flag values set to 1 on the first match
break # exits the inner for-loop for no more words need to be checked and so next line can be checked
if flag == 0 :
result.append(line) # using lines when there is no match as if-matched, the value of flag would have been 1
print(result)
This results in : 结果是:
["Let's go read.", 'Coming to library']
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.