简体   繁体   English

将法语文本文件读入列表中的项目-Python

[英]Reading french text files into items in a list - Python

I want to read a bunch of text documents in french, and store the content of each text document as an item in a list, in order to calculate the td-idf score later on (by counting words and etc). 我想阅读一堆法文文本文档,并将每个文本文档的内容作为一个项目存储在列表中,以便稍后计算td-idf得分(通过计算单词等)。

This is how I started my code, the point of it is to read each document 's full text as a string seperately: 这是我开始代码的方式,重点是分别读取每个文档的全文作为字符串:

import os, re
import glob
import operator

file_names = glob.glob(os.path.join("/Corpus", u'*'))
documents=["" for x in file_names]
files=["" for x in file_names]
for infile in (glob.glob(os.path.join("/Corpus", u'*'))):
    file = (open(infile,"r",encoding="utf-8"))
    data = file.read()
    print (data)

When i execute this, he is able to print some of the text, but then I get the following error : 当我执行此操作时,他能够打印一些文本,但是随后出现以下错误:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe2 in position 10: invalid continuation byte

I am obviously opening the file with an encoding of utf-8, I don't understand what i'm doing wrong. 我显然是用utf-8编码打开文件,我不明白我在做什么错。

Also, I would appreciate any suggestions on how I could store the variable data that contains all the text in the document in an item of a list. 此外,对于任何有关如何存储包含列表中文档中所有文本的变量data建议,我将不胜感激。 The following solution didn't work: 以下解决方案不起作用:

documents.append(data)

Thank you 谢谢

It seems that the files you are trying to read are not encoded in UTF-8. 您尝试读取的文件似乎未以UTF-8编码。 Best is to try and find out the encoding used to save the files. 最好是尝试找出用于保存文件的编码。 If that is not possible, your best bet is to try a couple encodings and see which one works (see https://docs.python.org/3/library/codecs.html#standard-encodings ). 如果不可能,那么最好的办法是尝试几种编码,然后看看哪种编码有效(请参阅https://docs.python.org/3/library/codecs.html#standard-encodings )。

For your second question: documents.append(data) should work. 对于第二个问题:documents.append(data)应该起作用。 Your mistake is that you don't initialize a Python list up front. 您的错误是您没有预先初始化Python列表。 So this is all you need: 这就是您所需要的:

documents = []
for infile in file_names:
    ...
    documents.append(data)

Final tip: you're opening files, but not closing them. 最后提示:您正在打开文件,但不关闭它们。 The with operator can help you here. with运算符可以在这里为您with帮助。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM