在python中刪除文件中的停用詞

Question

我有一個文件，其中包含停用詞（每個都在一個新行中）和另一個文件（實際上是一個語料庫），其中每個都在一個新行中包含很多句子。 我必須刪除語料庫中的停用詞並返回沒有停用詞的每一行。 我寫了一段代碼，但它只返回一個句子。 （語言是波斯語）。 如何修復它返回所有句子？

with open ("stopwords.txt", encoding = "utf-8") as f1:
   with open ("train.txt", encoding = "utf-8") as f2:
      for i in f1:
          for line in f2:
              if i in line:
                 line= line.replace(i, "")
with open ("NoStopWordsTrain.txt", "w", encoding = "utf-8") as f3:
   f3.write (line)

Answer 1

問題是你的最后兩行代碼不在 for 循環中。 您正在逐行遍歷整個 f2，並且什么也不做。 然后，在最后一行之后，只將最后一行寫入 f3。 相反，請嘗試：

with open("stopwords.txt", encoding = "utf-8") as stopfile:
    stopwords = stopfile.readlines() # make it into a convenient list
    print stopwords # just to check that this words
with open("train.txt", encoding = "utf-8") as trainfile:
    with open ("NoStopWordsTrain.txt", "w", encoding = "utf-8") as newfile:
        for line in trainfile: # go through each line
            for word in stopwords: # go through and replace each word
                line= line.replace(word, "")
            newfile.write (line)

Answer 2

您可以遍歷兩個文件，然后寫入第三個文件。 @Noam是對的，因為您在打開最后一個文件時遇到了縮進問題。

with open("stopwords.txt", encoding="utf-8") as sw, open("train.txt", encoding="utf-8") as train, open("NoStopWordsTrain.txt", "w", encoding="utf-8") as no_sw:
    stopwords = sw.readlines()
    no_sw.writelines(line + "\n" for line in train.readlines() if line not in stopwords)

這基本上只是寫入train 中的所有行，如果它是停用詞之一，則對其進行過濾。

如果覺得with open(...行太長，可以利用 Python 的partial函數來設置默認參數。

from functools import partial
utfopen = partial(open, encoding="utf-8")

with utfopen("stopwords.txt") as sw, utfopen("train.txt") as train, utfopen("NoStopWordsTrain.txt", "w") as no_sw:
    #Rest of your code here

在python中刪除文件中的停用詞

問題描述

2 個解決方案

解決方案1
0 已采納 2016-08-08 18:28:47

解決方案2
0 2016-08-08 18:39:21

在python中刪除文件中的停用詞

問題描述

2 個解決方案

解決方案1 0 已采納 2016-08-08 18:28:47

解決方案2 0 2016-08-08 18:39:21

解決方案1
0 已采納 2016-08-08 18:28:47

解決方案2
0 2016-08-08 18:39:21