[英]Reading french text files into items in a list - Python
I want to read a bunch of text documents in french, and store the content of each text document as an item in a list, in order to calculate the td-idf score later on (by counting words and etc). 我想阅读一堆法文文本文档,并将每个文本文档的内容作为一个项目存储在列表中,以便稍后计算td-idf得分(通过计算单词等)。
This is how I started my code, the point of it is to read each document 's full text as a string seperately: 这是我开始代码的方式,重点是分别读取每个文档的全文作为字符串:
import os, re
import glob
import operator
file_names = glob.glob(os.path.join("/Corpus", u'*'))
documents=["" for x in file_names]
files=["" for x in file_names]
for infile in (glob.glob(os.path.join("/Corpus", u'*'))):
file = (open(infile,"r",encoding="utf-8"))
data = file.read()
print (data)
When i execute this, he is able to print some of the text, but then I get the following error : 当我执行此操作时,他能够打印一些文本,但是随后出现以下错误:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe2 in position 10: invalid continuation byte
I am obviously opening the file with an encoding of utf-8, I don't understand what i'm doing wrong. 我显然是用utf-8编码打开文件,我不明白我在做什么错。
Also, I would appreciate any suggestions on how I could store the variable data
that contains all the text in the document in an item of a list. 此外,对于任何有关如何存储包含列表中文档中所有文本的变量data
建议,我将不胜感激。 The following solution didn't work: 以下解决方案不起作用:
documents.append(data)
Thank you 谢谢
It seems that the files you are trying to read are not encoded in UTF-8. 您尝试读取的文件似乎未以UTF-8编码。 Best is to try and find out the encoding used to save the files. 最好是尝试找出用于保存文件的编码。 If that is not possible, your best bet is to try a couple encodings and see which one works (see https://docs.python.org/3/library/codecs.html#standard-encodings ). 如果不可能,那么最好的办法是尝试几种编码,然后看看哪种编码有效(请参阅https://docs.python.org/3/library/codecs.html#standard-encodings )。
For your second question: documents.append(data) should work. 对于第二个问题:documents.append(data)应该起作用。 Your mistake is that you don't initialize a Python list up front. 您的错误是您没有预先初始化Python列表。 So this is all you need: 这就是您所需要的:
documents = []
for infile in file_names:
...
documents.append(data)
Final tip: you're opening files, but not closing them. 最后提示:您正在打开文件,但不关闭它们。 The with
operator can help you here. with
运算符可以在这里为您with
帮助。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.