繁体   English   中英

在python中删除文件中的停用词

[英]delete stop words in a file in python

我有一个文件,其中包含停用词(每个都在一个新行中)和另一个文件(实际上是一个语料库),其中每个都在一个新行中包含很多句子。 我必须删除语料库中的停用词并返回没有停用词的每一行。 我写了一段代码,但它只返回一个句子。 (语言是波斯语)。 如何修复它返回所有句子?

with open ("stopwords.txt", encoding = "utf-8") as f1:
   with open ("train.txt", encoding = "utf-8") as f2:
      for i in f1:
          for line in f2:
              if i in line:
                 line= line.replace(i, "")
with open ("NoStopWordsTrain.txt", "w", encoding = "utf-8") as f3:
   f3.write (line)

问题是你的最后两行代码不在 for 循环中。 您正在逐行遍历整个 f2,并且什么也不做。 然后,在最后一行之后,只将最后一行写入 f3。 相反,请尝试:

with open("stopwords.txt", encoding = "utf-8") as stopfile:
    stopwords = stopfile.readlines() # make it into a convenient list
    print stopwords # just to check that this words
with open("train.txt", encoding = "utf-8") as trainfile:
    with open ("NoStopWordsTrain.txt", "w", encoding = "utf-8") as newfile:
        for line in trainfile: # go through each line
            for word in stopwords: # go through and replace each word
                line= line.replace(word, "")
            newfile.write (line)

您可以遍历两个文件,然后写入第三个文件。 @Noam是对的,因为您在打开最后一个文件时遇到了缩进问题。

with open("stopwords.txt", encoding="utf-8") as sw, open("train.txt", encoding="utf-8") as train, open("NoStopWordsTrain.txt", "w", encoding="utf-8") as no_sw:
    stopwords = sw.readlines()
    no_sw.writelines(line + "\n" for line in train.readlines() if line not in stopwords)

这基本上只是写入train 中的所有行,如果它是停用词之一,则对其进行过滤。

如果觉得with open(...行太长,可以利用 Python 的partial函数来设置默认参数。

from functools import partial
utfopen = partial(open, encoding="utf-8")

with utfopen("stopwords.txt") as sw, utfopen("train.txt") as train, utfopen("NoStopWordsTrain.txt", "w") as no_sw:
    #Rest of your code here

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM