Python - Finding unicode/ascii problems

Question

I am csv.reader to pull in info from a very long sheet. I am doing work on that data set and then I am using the xlwt package to give me a workable excel file.

However, I get this error:

UnicodeDecodeError: 'ascii' codec can't decode byte 0x92 in position 34: ordinal not in range(128)

My question to you all is, how can I find exactly where that error is in my data set? Also, is there some code that I can write which will look through my data set and find out where the issues lie (because some data sets run without the above error and others have problems)?

Answer 1

The answer is quite simple actually : As soon as you read your data from your file, convert it to unicode using the encoding of your file, and handle the UnicodeDecodeError exception :

try:
        # decode using utf-8 (use ascii if you want)
        unicode_data = str_data.decode("utf-8")
except UnicodeDecodeError, e:
        print "The error is there !"

this will save you from many troubles; you won't have to worry about multibyte character encoding, and external libraries (including xlwt) will just do The Right Thing if they need to write it.

Python 3.0 will make it mandatory to specify the encoding of a string, so it's a good idea to do it now.

Answer 2

The csv module doesn't support unicode and null characters. You might be able to replace them by doing something like this though (Replace 'utf-8' with the encoding which your CSV data is encoded in):

import codecs
import csv

class AsciiFile:
    def __init__(self, path):
        self.f = codecs.open(path, 'rb', 'utf-8')

    def close(self):
        self.f.close()

    def __iter__(self):
        for line in self.f:
            # 'replace' for unicode characters -> ?, 'ignore' to ignore them
            y = line.encode('ascii', 'replace')
            y = y.replace('\0', '?') # Can't handle null characters!
            yield y

f = AsciiFile(PATH)
r = csv.reader(f)
...
f.close()

If you want to find the positions of the characters which you can't be handled by the CSV module, you could do eg:

import codecs

lineno = 0
f = codecs.open(PATH, 'rb', 'utf-8')
for line in f:
    for x, c in enumerate(line):
        if not c.encode('ascii', 'ignore') or c == '\0':
            print "Character ordinal %s line %s character %s is unicode or null!" % (ord(c), lineno, x)
    lineno += 1
f.close()

Alternatively again, you could use this CSV opener which I wrote which can handle Unicode characters:

import codecs

def OpenCSV(Path, Encoding, Delims, StartAtRow, Qualifier, Errors):
    infile = codecs.open(Path, "rb", Encoding, errors=Errors)
    for Line in infile:
        Line = Line.strip('\r\n')
        if (StartAtRow - 1) and StartAtRow > 0: StartAtRow -= 1
        elif Qualifier != '(None)':
            # Take a note of the chars 'before' just 
            # in case of excel-style """ quoting.
            cB41 = ''; cB42 = ''
            L = ['']
            qMode = False
            for c in Line: 
                if c==Qualifier and c==cB41==cB42 and qMode:
                    # Triple qualifiers, so allow it with one
                    L[-1] = L[-1][:-2]
                    L[-1] += c
                elif c==Qualifier: 
                    # A qualifier, so reverse qual mode
                    qMode = not qMode
                elif c in Delims and not qMode: 
                    # Not in qual mode and delim
                    L.append('')
                else: 
                    # Nothing to see here, move along
                    L[-1] += c
                cB42 = cB41
                cB41 = c
            yield L
        else:
            # There aren't any qualifiers.
            cB41 = ''; cB42 = ''
            L = ['']
            for c in Line: 
                cB42 = cB41; cB41 = c
                if c in Delims: 
                    # Delim
                    L.append('')
                else: 
                    # Nothing to see here, move along
                    L[-1] += c
            yield L

for listItem in openCSV(PATH, Encoding='utf-8', Delims=[','], StartAtRow=0, Qualifier='"', Errors='replace')
    ...

Answer 3

You can refer to code snippets in the question below to get a csv reader with unicode encoding support:

General Unicode/UTF-8 support for csv files in Python 2.6

Answer 4

PLEASE give the full traceback that you got along with the error message. When we know where you are getting the error (reading CSV file, "doing work on that data set", or in writing an XLS file using xlwt), then we can give a focused answer.

It is very possible that your input data is not all plain old ASCII. What produces it, and in what encoding?

To find where the problems (not necessarily errors) are, try a little script like this (untested):

import sys, glob
for pattern in sys.argv[1:]:
    for filepath in glob.glob(pattern):
        for linex, line in enumerate(open(filepath, 'r')):
            if any(c >= '\x80' for c in line):
                print "Non-ASCII in line %d of file %r" % (linex+1, filepath)
                print repr(line)

It would be useful if you showed some samples of the "bad" lines that you find, so that we can judge what the encoding might be.

I'm curious about using "csv.reader to pull in info from a very long sheet" -- what kind of "sheet"? Do you mean that you are saving an XLS file as CSV, then reading the CSV file? If so, you could use xlrd to read directly from the input XLS file, getting unicode text which you can give straight to xlwt , avoiding any encode/decode problems.

Have you worked through the tutorial from the python-excel.org site ?

Python - Finding unicode/ascii problems

Question

4 answers

solution1
3 ACCPTED 2010-05-02 10:26:14

solution2
1 2010-05-02 10:13:33

solution3
0 2010-05-02 10:48:12

solution4
0 2010-05-02 11:43:08

Python - Finding unicode/ascii problems

Question

4 answers

solution1 3 ACCPTED 2010-05-02 10:26:14

solution2 1 2010-05-02 10:13:33

solution3 0 2010-05-02 10:48:12

solution4 0 2010-05-02 11:43:08

solution1
3 ACCPTED 2010-05-02 10:26:14

solution2
1 2010-05-02 10:13:33

solution3
0 2010-05-02 10:48:12

solution4
0 2010-05-02 11:43:08