简体   繁体   English

在python中删除文件中的停用词

[英]delete stop words in a file in python

I have a file which consists of stop words (each in a new line) and another file (a corpus actually) which consists of a lot of sentences each in a new line.我有一个文件,其中包含停用词(每个都在一个新行中)和另一个文件(实际上是一个语料库),其中每个都在一个新行中包含很多句子。 I have to delete the stop words in the corpus and return each line of that without stop words.我必须删除语料库中的停用词并返回没有停用词的每一行。 I wrote a code but it just returns one sentence.我写了一段代码,但它只返回一个句子。 (The language is Persian). (语言是波斯语)。 How can fix it that it returns all of the sentences?如何修复它返回所有句子?

with open ("stopwords.txt", encoding = "utf-8") as f1:
   with open ("train.txt", encoding = "utf-8") as f2:
      for i in f1:
          for line in f2:
              if i in line:
                 line= line.replace(i, "")
with open ("NoStopWordsTrain.txt", "w", encoding = "utf-8") as f3:
   f3.write (line)

The problem is that your last two lines of code are not in the for loop.问题是你的最后两行代码不在 for 循环中。 You are iterating through the entire f2, line-by-line, and doing nothing with it.您正在逐行遍历整个 f2,并且什么也不做。 Then, after the last line, you write just that last line to f3.然后,在最后一行之后,只将最后一行写入 f3。 Instead, try:相反,请尝试:

with open("stopwords.txt", encoding = "utf-8") as stopfile:
    stopwords = stopfile.readlines() # make it into a convenient list
    print stopwords # just to check that this words
with open("train.txt", encoding = "utf-8") as trainfile:
    with open ("NoStopWordsTrain.txt", "w", encoding = "utf-8") as newfile:
        for line in trainfile: # go through each line
            for word in stopwords: # go through and replace each word
                line= line.replace(word, "")
            newfile.write (line)

You can just iterate through both files, and write to the third one.您可以遍历两个文件,然后写入第三个文件。 @Noam was right in that you had issues with the indentation of your last file open. @Noam是对的,因为您在打开最后一个文件时遇到了缩进问题。

with open("stopwords.txt", encoding="utf-8") as sw, open("train.txt", encoding="utf-8") as train, open("NoStopWordsTrain.txt", "w", encoding="utf-8") as no_sw:
    stopwords = sw.readlines()
    no_sw.writelines(line + "\n" for line in train.readlines() if line not in stopwords)

This basically just writes all the lines in train , and filtering it if it's one of the stopwords.这基本上只是写入train 中的所有行,如果它是停用词之一,则对其进行过滤。

If you think the with open(... line is too long, you can make use of Python's partial function to set default parameters.如果觉得with open(...行太长,可以利用 Python 的partial函数来设置默认参数。

from functools import partial
utfopen = partial(open, encoding="utf-8")

with utfopen("stopwords.txt") as sw, utfopen("train.txt") as train, utfopen("NoStopWordsTrain.txt", "w") as no_sw:
    #Rest of your code here

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM