简体   繁体   中英

Reading/Encoding Chinese characters from CSV files in Python

I'm trying to read a CSV file that contains information in simplified Chinese, and encode it into a request to put into the database.

Section of my code:

#coding:utf-8    
import csv, sys, urllib, urllib2

with open('testdata1.csv', 'rU') as f:
    reader = csv.reader(f)
    try:
        z = csv.reader(f, delimiter='\t')
        for row in reader:
            print row[0]
            if row[0] in (None, ""): 
                continue
            elif row[0] == '家长姓': 
                print row[0]

However I'm encountering two problems:

1) Sublime Text cannot understand Chinese characters, aka it does not understand to look for '家长姓' in the command elif row[0] == '家长姓' .

2) Sublime Text doesn't seem to be able to print Chinese characters (when I tell it to print some of the information, all Chinese characters are replaced by underscores).

I've already tried File>Save with Encoding>UTF-8 to no avail. Any help would be appreciated.

Try to open file using codecs with the appropriate encoding:

>>> import codecs
>>> f = codecs.open("testdata1.csv", "r", "utf-8") 

Non ASCII characters are always hard to use because there are 3 different problems:

  • the system and the editor must be able to display them
  • the encoding of source file must be specified ( # -*- coding: ... -*- in first or second line)
  • all the above is independant of the system encoding ( sys.encoding that will be used for rendering them

First, you coding line forgot the -*- , meaning that some editors could fail to correctly process the encoding.

You could also try whether IDLE editor processes more easily the chinese characters.

But anyway, if every else fails, you can always use explicit unicode codes:

>>> txt = u'家长姓' # only works if editor and interpretor were correctly declared the source encoding
>>> txt2 = u'\xe5\xae\xb6\xe9\x95\xbf\xe5\xa7\x93' # works on any system
>>> txt == txt2
True

TL/DR: if you have problem to use non ASCII characters in Python source, use their escaped code

'家长姓' in your code is a <type 'str'> ,and the content you read from is also a <type 'str'> ,but maybe their encoding methods are not the same.You can decode them to be <type 'unicode'> before the compare.

For example:

row[0].decode('utf-8') == u'家长姓'

And here is a test about str and unicode:

test = '你好'
test1 = u'你好'
print type(test)
print type(test1)
print test == test1
print type(test.decode('utf-8'))
print test.decode('utf-8') == test1

output:

<type 'str'>
<type 'unicode'>
False
<type 'unicode'>
True

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM