简体   繁体   English

如何从文本文件中删除单词集及其变体(或词尾变化)?

[英]How to remove set of words and their variants (or inflections) from a text file?

I am trying to remove lines from a text file that contains certain words and their variants (I'm afraid it's the correct word) using python. 我正在尝试使用python从包含某些单词及其变体(恐怕是正确的单词)的文本文件中删除行。

What I mean by variants: 我的意思是变体:

"Yay","yay",'“Yay','Yay”',"Yay;","Yay?","Yay’s","Yay's",'Yay!','Yay.',"Yay”;"

So, I tried doing it manually using the following code: 因此,我尝试使用以下代码手动进行操作:

infile1 = open("file1.txt",'r')
outfile1 = open("file2.txt",'w')

word_list = ["Yay","yay",'“Yay','Yay”',"Yay;","Yay?","Yay’s","Yay's",'Yay!','Yay.',"Yay”;"]

for line in infile1:
    tempList = line.split()
    if any((el in tempList for el in word_list)):
        continue
    else:
        outfile1.write(line)

It didn't work out well, some of the words mentioned in word_list were still present in the output file. 效果不佳,在word_list中提到的某些单词仍然存在于输出文件中。 There are lots of more word variants to consider (like God, God!, book, Book, books, books? etc). 还有更多的单词变体要考虑(例如上帝,上帝!,书,书,书,书等)。

I was wondering if there is a way to do it more efficiently (with RE may be!). 我想知道是否有一种方法可以更有效地执行此操作(可能使用RE!)。

EDIT 1: 编辑1:

Input: Sample.txt: 输入:Sample.txt:

I want my book.

I need my books.

Why you need a book?

Let's go read.

Coming to library

I need to remove all the lines containing "book.","books.", "book?" 我需要删除包含"book.","books.", "book?"所有行"book.","books.", "book?" from my sample.txt file. 从我的sample.txt文件中。

Output: Fixed.txt: 输出:Fixed.txt:

Let's go read

Coming to library

NOTE: The original corpus has around 60,000 lines 注意:原始语料库大约有60,000行

You can set a flag for every line and emit based on the flag value, something like this : 您可以为每行设置一个flag ,并根据flag值发出,如下所示:

input_sample = [
    "I want my book.",
    "I need my books.",
    "Why you need a book?",
    "Let's go read.",
    "Coming to library"
]
words = ['book']
result = []
for line in input_sample : 
    flag = 0    # will be used to check if match is found or not
    for word in words : 
        if word.lower() in line.lower() :    # converting both words and lines to lowercase so case is not a factor in matching
            flag = 1    # flag values set to 1 on the first match
            break    # exits the inner for-loop for no more words need to be checked and so next line can be checked
    if flag == 0 :                      
        result.append(line)    # using lines when there is no match as if-matched, the value of flag would have been 1

print(result)

This results in : 结果是:

["Let's go read.", 'Coming to library']

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM