I want to read a bunch of text documents in french, and store the content of each text document as an item in a list, in order to calculate the td-idf score later on (by counting words and etc).
This is how I started my code, the point of it is to read each document 's full text as a string seperately:
import os, re
import glob
import operator
file_names = glob.glob(os.path.join("/Corpus", u'*'))
documents=["" for x in file_names]
files=["" for x in file_names]
for infile in (glob.glob(os.path.join("/Corpus", u'*'))):
file = (open(infile,"r",encoding="utf-8"))
data = file.read()
print (data)
When i execute this, he is able to print some of the text, but then I get the following error :
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe2 in position 10: invalid continuation byte
I am obviously opening the file with an encoding of utf-8, I don't understand what i'm doing wrong.
Also, I would appreciate any suggestions on how I could store the variable data
that contains all the text in the document in an item of a list. The following solution didn't work:
documents.append(data)
Thank you
It seems that the files you are trying to read are not encoded in UTF-8. Best is to try and find out the encoding used to save the files. If that is not possible, your best bet is to try a couple encodings and see which one works (see https://docs.python.org/3/library/codecs.html#standard-encodings ).
For your second question: documents.append(data) should work. Your mistake is that you don't initialize a Python list up front. So this is all you need:
documents = []
for infile in file_names:
...
documents.append(data)
Final tip: you're opening files, but not closing them. The with
operator can help you here.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.