简体   繁体   中英

delete stop words in a file in python

I have a file which consists of stop words (each in a new line) and another file (a corpus actually) which consists of a lot of sentences each in a new line. I have to delete the stop words in the corpus and return each line of that without stop words. I wrote a code but it just returns one sentence. (The language is Persian). How can fix it that it returns all of the sentences?

with open ("stopwords.txt", encoding = "utf-8") as f1:
   with open ("train.txt", encoding = "utf-8") as f2:
      for i in f1:
          for line in f2:
              if i in line:
                 line= line.replace(i, "")
with open ("NoStopWordsTrain.txt", "w", encoding = "utf-8") as f3:
   f3.write (line)

The problem is that your last two lines of code are not in the for loop. You are iterating through the entire f2, line-by-line, and doing nothing with it. Then, after the last line, you write just that last line to f3. Instead, try:

with open("stopwords.txt", encoding = "utf-8") as stopfile:
    stopwords = stopfile.readlines() # make it into a convenient list
    print stopwords # just to check that this words
with open("train.txt", encoding = "utf-8") as trainfile:
    with open ("NoStopWordsTrain.txt", "w", encoding = "utf-8") as newfile:
        for line in trainfile: # go through each line
            for word in stopwords: # go through and replace each word
                line= line.replace(word, "")
            newfile.write (line)

You can just iterate through both files, and write to the third one. @Noam was right in that you had issues with the indentation of your last file open.

with open("stopwords.txt", encoding="utf-8") as sw, open("train.txt", encoding="utf-8") as train, open("NoStopWordsTrain.txt", "w", encoding="utf-8") as no_sw:
    stopwords = sw.readlines()
    no_sw.writelines(line + "\n" for line in train.readlines() if line not in stopwords)

This basically just writes all the lines in train , and filtering it if it's one of the stopwords.

If you think the with open(... line is too long, you can make use of Python's partial function to set default parameters.

from functools import partial
utfopen = partial(open, encoding="utf-8")

with utfopen("stopwords.txt") as sw, utfopen("train.txt") as train, utfopen("NoStopWordsTrain.txt", "w") as no_sw:
    #Rest of your code here

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM