简体   繁体   中英

UnicodeDecodeError when reading a text file

I am a beginner to Python (I am using 3.4). This is the relevant part of my code.

fileObject = open("countable nouns raw.txt", "rt")
bigString = fileObject.read()
fileObject.close()

Whenever I try to read this file I get:

UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 82273: character maps to <undefined>

I have been reading around and it seems to be something to do with my default encoding not matching the text file encoding. I've read in another post that you can use this method to read a file with a specific encoding:

import codecs
f = codecs.open("file.txt", "r", "utf-8")

But you have to know it in advance. The thing is I don't know how the text file is encoded. A few posts suggested using Chardet. I've installed it but I have no idea how to get it to read a text file.

Any ideas on how to get around this??

There is no need to use codecs.open() ; that's advice for Python 2.

In Python 3 open() takes an encoding argument:

fileObject = open("countable nouns raw.txt", "rt", encoding='utf8')

This does require that you know what codec was used for the file, of course. Generally speaking is no easy way for Python to figure that out; individual file formats may include codec information or have standardised on a given codec, but if all you have a generic text file you'll have to figure out what created it and what codec that used to write the data.

In addition to using the correct Python method to specifiy the encoding when using open , you could try to get the encoding using the file tool.

A file foo.txt containing

ÙÚÛÜ

can be checked using

$ file foo.txt 
foo.txt: UTF-8 Unicode text
$ wc foo.txt
1 1 9 foo.txt

As you can see by using wc , it contains nine bytes, two for each character, one newline.

To add to Martijn Pieters answer,you may want to check out this link: http://osxdaily.com/2015/08/11/determine-file-type-encoding-command-line-mac-os-x/

if you are a Mac user and have trouble figuring out what encoding a particular file you have is in.

One way you can detect the encoding on any operating system is by using the library chardet. If you don't have it, make sure you run pip install chardet . After that, it is fairly simple:

import chardet
import requests
content = requests.get("http://yahoo.co.jp/").content
detect = chardet.detect(content)
print(detect)

This library tries to detect what the encoding is. This doesn't mean that it is 100% right, just that it will likely be correct. Then you can just read the file:

open('file.txt', encoding=detect['encoding'])

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM