简体   繁体   中英

Reading french text files into items in a list - Python

I want to read a bunch of text documents in french, and store the content of each text document as an item in a list, in order to calculate the td-idf score later on (by counting words and etc).

This is how I started my code, the point of it is to read each document 's full text as a string seperately:

import os, re
import glob
import operator

file_names = glob.glob(os.path.join("/Corpus", u'*'))
documents=["" for x in file_names]
files=["" for x in file_names]
for infile in (glob.glob(os.path.join("/Corpus", u'*'))):
    file = (open(infile,"r",encoding="utf-8"))
    data = file.read()
    print (data)

When i execute this, he is able to print some of the text, but then I get the following error :

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe2 in position 10: invalid continuation byte

I am obviously opening the file with an encoding of utf-8, I don't understand what i'm doing wrong.

Also, I would appreciate any suggestions on how I could store the variable data that contains all the text in the document in an item of a list. The following solution didn't work:

documents.append(data)

Thank you

It seems that the files you are trying to read are not encoded in UTF-8. Best is to try and find out the encoding used to save the files. If that is not possible, your best bet is to try a couple encodings and see which one works (see https://docs.python.org/3/library/codecs.html#standard-encodings ).

For your second question: documents.append(data) should work. Your mistake is that you don't initialize a Python list up front. So this is all you need:

documents = []
for infile in file_names:
    ...
    documents.append(data)

Final tip: you're opening files, but not closing them. The with operator can help you here.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM