简体   繁体   中英

python read utf8 text file problem

I have a problem with python about reading and print utf8 text file.

I have a test.txt in utf8 encoding without BOM. This file has two characters in it:

大声

The first character "大" is Chinese and the second "声" is Japanese. Now, When I use Ulipad (a python editor) to run the following code to read the txt file, and print these two characters.

import codecs
infile = "C:\\test.txt"

f = codecs.open(infile, "r", "utf-8")
s = f.read()

print(s)

I got this error,

"UnicodeEncodeError: 'cp950' codec can't encode character '\u58f0' in position 1:
 illegal multibyte sequence"

I found it caused from the second character "声" .

But when I use the same code to test in python default GUI IDLE, it works to print the two characters with no error. So, how can I fix the problem.

My running environment is python 3.1 , windows xp traditional Chinese.

You get the error when you are printing because:

(1) Ulipad is printing to sys.stdout which is the stdout of the legacy MS-DOS Command Prompt window. (2) Your traditional chinese Windows XP uses cp950 encoding, which is big5 plus Microsoftian fiddling. (3) You say your 2nd character is Japanese by which you probably mean that it's not also Chinese and thus unlikely to be a valid character in big5+.

On the other hand IDLE is writing to its own window and is not bound on the MS-DOS wheel :-) ... so there's a much greater repertoire of characters that it can print.

声 may be Japanese, but it is also the Simplified Chinese for "sound" (traditional 聲). cp950 is Traditional Chinese and doesn't support that simplified character.

Since you are using a Chinese version of Windows, you may be able to change your default code page to cp936 (Unified Chinese) and see the output.

I'm unfamiliar with Ulipad, but try running in a Windows console:

chcp 936

and then running your script. If that doesn't work, you can change the default language for non-Unicode programs through Control Panel, Regional and Language Options, Advanced tab. This is how I was able to print Chinese in a console on my US English-based Windows.

Update

Reading about Ulipad, it says:

Multilanguage support Currently supports 4 languages: English, Spanish, Simplified Chinese and Traditional Chinese, which can be auto-detected.

Perhaps you can override the auto-detected Traditional Chinese to Simplified Chinese, which may select a code page and/or font that supports that particular character. Since it doesn't support Japanese, there will probably still be characters you can't display properly.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM