循环浏览文件以删除停用词

Question

I want to remove stop-words from multiple files in a local folder. 我想从本地文件夹中的多个文件中删除停用词。 I know how to do it for one file, but I can't get my head around doing it for all files in that folder. 我知道如何对一个文件执行此操作，但是我无法全神贯注地对该文件夹中的所有文件执行操作。

What I embarrassingly tried: 我尴尬地尝试了：

import io
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import glob
import os
import codecs


stop_words = set(stopwords.words('english'))

for afile in glob.glob("*.txt"):
    file1 = open(afile)
    line = file1.read()
    words = word_tokenize(line)
    words_without_stop_words = ["" if word in stop_words else word for word in words]
    new_words = " ".join(words_without_stop_words).strip()
    appendFile = open('subfolder/file1.txt','w')
    appendFile.write(new_words)
    appendFile.close()

I don't even know how far I could get with this, because I get: 我什至不知道我能得到多少，因为我得到了：

Traceback (most recent call last): File "C:\\Desktop\\neg\\sw.py", line 14, in line = file1.read() File "C:\\Program Files\\Python36\\lib\\encodings\\cp1252.py", line 23, in decode return codecs.charmap_decode(input,self.errors,decoding_table)[0] UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 1757: character maps to <undefined> 追溯（最近一次通话）：文件“ C：\\ Desktop \\ neg \\ sw.py”，第14行，在第=行file1.read（）文件“ C：\\ Program Files \\ Python36 \\ lib \\ encodings \\ cp1252.py “，第23行，解码返回的编解码器。charmap_decode（input，self.errors，decoding_table）[0] UnicodeDecodeError：'charmap'编解码器无法解码位置1757的字节0x9d：字符映射为<undefined>

I tried using glob, but I can't find a good documentation. 我尝试使用glob，但是找不到很好的文档。 Maybe it is not necessary? 也许没有必要？

Answer 1

Seems like the encoding is wrong for your file. 似乎文件的编码错误。 You will need to call open() function with proper encoding kwarg (it might be "utf-8" ). 您将需要使用正确的encoding kwarg调用open（）函数（它可能是"utf-8" ）。 And use 'a' when you want to append your file. 并在要附加文件时使用'a' 。 I would actually open the append file before working with files and close it after all files has been written. 实际上，我将在处理文件之前打开附加文件，并在写入所有文件后将其关闭。

When filtering your words from stopwords do not put empty strings into the list, just omit those words: 从停用词过滤单词时，请勿将空字符串放入列表中，只需忽略这些单词即可：

words_without_stop_words = [word for word in words if word not in stop_words]
new_words = " ".join(words_without_stop_words).strip()

Answer 2

You have to add the encoding format while writing into file that is utf-8 normally you can do this using 您必须在写入utf-8文件时添加编码格式，通常可以使用

appendFile = open('subfolder/file1.txt','w', encoding='utf-8')
appendFile.write(new_words)
appendFile.close()

Instead of writing data into file you have to append data into file so that all data store into single file. 不必将数据写入文件，您必须将数据追加到文件中，以便将所有数据存储到单个文件中。

You can also use codecs for inserting into file like 您也可以使用编解码器插入文件，例如

f = codecs.open(filename, encoding="utf-8")

and insert the data. 并插入数据。

Answer 3

From the full stacktrace, you are using a Windows system with a Western European language and a default Ansi code page 1252. 从完整的堆栈跟踪中，您正在使用具有西欧语言和默认Ansi代码页1252的Windows系统。

One of your files contains a 0x9d byte. 您的文件之一包含一个0x9d字节。 At read time, Python tries to decode the file bytes to unicode strings and fails because 0x9d is not a valid CP1252 byte, hence the error. 在读取时，Python尝试将文件字节解码为unicode字符串，但由于0x9d不是有效的CP1252字节而失败，因此失败。

What can be done? 该怎么办？

The correct way is to identify the offending file and try to identify its real encoding. 正确的方法是识别有问题的文件，然后尝试识别其实际编码。 A simple way would be to display its name: 一种简单的方法是显示其名称：

for afile in glob.glob("*.txt"):
    with open(afile) as file1:
        try:
            line = file1.read()
        except UnicodeDecodeError as e:
            print("Wrong encoding file", afile, e)       # display file name and error
            continue                                     # skip to next file
    ...

Alternatively, if the error only happen if few words of few files, you could simply ignore or replace the offending bytes: 或者，如果仅在几个文件的单词很少的情况下发生错误，则可以简单地忽略或替换有问题的字节：

for afile in glob.glob("*.txt"):
    with open(afile, errors = "replace") as file1:
        line = file1.read()
    ...

循环浏览文件以删除停用词

问题描述

3 个解决方案

解决方案1
1 2018-08-17 09:57:13

解决方案2
1 已采纳 2018-08-17 10:13:40

解决方案3
1 2018-08-17 10:16:15

循环浏览文件以删除停用词

问题描述

3 个解决方案

解决方案1 1 2018-08-17 09:57:13

解决方案2 1 已采纳 2018-08-17 10:13:40

解决方案3 1 2018-08-17 10:16:15

解决方案1
1 2018-08-17 09:57:13

解决方案2
1 已采纳 2018-08-17 10:13:40

解决方案3
1 2018-08-17 10:16:15