简体   繁体   中英

Loop through files removing stop-words

I want to remove stop-words from multiple files in a local folder. I know how to do it for one file, but I can't get my head around doing it for all files in that folder.

What I embarrassingly tried:

import io
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import glob
import os
import codecs


stop_words = set(stopwords.words('english'))

for afile in glob.glob("*.txt"):
    file1 = open(afile)
    line = file1.read()
    words = word_tokenize(line)
    words_without_stop_words = ["" if word in stop_words else word for word in words]
    new_words = " ".join(words_without_stop_words).strip()
    appendFile = open('subfolder/file1.txt','w')
    appendFile.write(new_words)
    appendFile.close()

I don't even know how far I could get with this, because I get:

Traceback (most recent call last): File "C:\\Desktop\\neg\\sw.py", line 14, in line = file1.read() File "C:\\Program Files\\Python36\\lib\\encodings\\cp1252.py", line 23, in decode return codecs.charmap_decode(input,self.errors,decoding_table)[0] UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 1757: character maps to <undefined>

I tried using glob, but I can't find a good documentation. Maybe it is not necessary?

Seems like the encoding is wrong for your file. You will need to call open() function with proper encoding kwarg (it might be "utf-8" ). And use 'a' when you want to append your file. I would actually open the append file before working with files and close it after all files has been written.

When filtering your words from stopwords do not put empty strings into the list, just omit those words:

words_without_stop_words = [word for word in words if word not in stop_words]
new_words = " ".join(words_without_stop_words).strip()

You have to add the encoding format while writing into file that is utf-8 normally you can do this using

appendFile = open('subfolder/file1.txt','w', encoding='utf-8')
appendFile.write(new_words)
appendFile.close()

Instead of writing data into file you have to append data into file so that all data store into single file.

You can also use codecs for inserting into file like

f = codecs.open(filename, encoding="utf-8")

and insert the data.

From the full stacktrace, you are using a Windows system with a Western European language and a default Ansi code page 1252.

One of your files contains a 0x9d byte. At read time, Python tries to decode the file bytes to unicode strings and fails because 0x9d is not a valid CP1252 byte, hence the error.

What can be done?

The correct way is to identify the offending file and try to identify its real encoding. A simple way would be to display its name:

for afile in glob.glob("*.txt"):
    with open(afile) as file1:
        try:
            line = file1.read()
        except UnicodeDecodeError as e:
            print("Wrong encoding file", afile, e)       # display file name and error
            continue                                     # skip to next file
    ...

Alternatively, if the error only happen if few words of few files, you could simply ignore or replace the offending bytes:

for afile in glob.glob("*.txt"):
    with open(afile, errors = "replace") as file1:
        line = file1.read()
    ...

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM