简体   繁体   English

Python-查找unicode / ascii问题

[英]Python - Finding unicode/ascii problems

I am csv.reader to pull in info from a very long sheet. 我是csv.reader,可以从很长的表格中获取信息。 I am doing work on that data set and then I am using the xlwt package to give me a workable excel file. 我正在对该数据集进行处理,然后使用xlwt包给我一个可行的excel文件。

However, I get this error: 但是,我收到此错误:

UnicodeDecodeError: 'ascii' codec can't decode byte 0x92 in position 34: ordinal not in range(128) UnicodeDecodeError:'ascii'编解码器无法解码位置34的字节0x92:序数不在范围内(128)

My question to you all is, how can I find exactly where that error is in my data set? 我向所有人提出的问题是,如何才能找到该错误在数据集中的确切位置? Also, is there some code that I can write which will look through my data set and find out where the issues lie (because some data sets run without the above error and others have problems)? 另外,是否有一些我可以编写的代码可以浏览我的数据集并找出问题所在(因为某些数据集在运行时没有上述错误,而其他数据集却有问题)?

The answer is quite simple actually : As soon as you read your data from your file, convert it to unicode using the encoding of your file, and handle the UnicodeDecodeError exception : 答案实际上很简单:一旦从文件中读取数据,就使用文件的编码将其转换为unicode,并处理UnicodeDecodeError异常:

try:
        # decode using utf-8 (use ascii if you want)
        unicode_data = str_data.decode("utf-8")
except UnicodeDecodeError, e:
        print "The error is there !"

this will save you from many troubles; 这将使您免于许多麻烦; you won't have to worry about multibyte character encoding, and external libraries (including xlwt) will just do The Right Thing if they need to write it. 您不必担心多字节字符编码,如果需要编写外部库(包括xlwt),它们将只执行“正确的事情”。

Python 3.0 will make it mandatory to specify the encoding of a string, so it's a good idea to do it now. Python 3.0将强制要求指定字符串的编码,因此现在开​​始这样做是个好主意。

The csv module doesn't support unicode and null characters. csv模块不支持unicode和null字符。 You might be able to replace them by doing something like this though (Replace 'utf-8' with the encoding which your CSV data is encoded in): 不过,您可以通过执行以下操作来替换它们(将utf-8替换为CSV数据的编码格式):

import codecs
import csv

class AsciiFile:
    def __init__(self, path):
        self.f = codecs.open(path, 'rb', 'utf-8')

    def close(self):
        self.f.close()

    def __iter__(self):
        for line in self.f:
            # 'replace' for unicode characters -> ?, 'ignore' to ignore them
            y = line.encode('ascii', 'replace')
            y = y.replace('\0', '?') # Can't handle null characters!
            yield y

f = AsciiFile(PATH)
r = csv.reader(f)
...
f.close()

If you want to find the positions of the characters which you can't be handled by the CSV module, you could do eg: 如果要查找CSV模块无法处理的字符位置,则可以执行以下操作:

import codecs

lineno = 0
f = codecs.open(PATH, 'rb', 'utf-8')
for line in f:
    for x, c in enumerate(line):
        if not c.encode('ascii', 'ignore') or c == '\0':
            print "Character ordinal %s line %s character %s is unicode or null!" % (ord(c), lineno, x)
    lineno += 1
f.close()

Alternatively again, you could use this CSV opener which I wrote which can handle Unicode characters: 同样,您也可以使用我编写的此CSV开瓶器,它可以处理Unicode字符:

import codecs

def OpenCSV(Path, Encoding, Delims, StartAtRow, Qualifier, Errors):
    infile = codecs.open(Path, "rb", Encoding, errors=Errors)
    for Line in infile:
        Line = Line.strip('\r\n')
        if (StartAtRow - 1) and StartAtRow > 0: StartAtRow -= 1
        elif Qualifier != '(None)':
            # Take a note of the chars 'before' just 
            # in case of excel-style """ quoting.
            cB41 = ''; cB42 = ''
            L = ['']
            qMode = False
            for c in Line: 
                if c==Qualifier and c==cB41==cB42 and qMode:
                    # Triple qualifiers, so allow it with one
                    L[-1] = L[-1][:-2]
                    L[-1] += c
                elif c==Qualifier: 
                    # A qualifier, so reverse qual mode
                    qMode = not qMode
                elif c in Delims and not qMode: 
                    # Not in qual mode and delim
                    L.append('')
                else: 
                    # Nothing to see here, move along
                    L[-1] += c
                cB42 = cB41
                cB41 = c
            yield L
        else:
            # There aren't any qualifiers.
            cB41 = ''; cB42 = ''
            L = ['']
            for c in Line: 
                cB42 = cB41; cB41 = c
                if c in Delims: 
                    # Delim
                    L.append('')
                else: 
                    # Nothing to see here, move along
                    L[-1] += c
            yield L

for listItem in openCSV(PATH, Encoding='utf-8', Delims=[','], StartAtRow=0, Qualifier='"', Errors='replace')
    ...

You can refer to code snippets in the question below to get a csv reader with unicode encoding support: 您可以参考以下问题中的代码段,以获取支持unicode编码的csv阅读器:

PLEASE give the full traceback that you got along with the error message. 请给出与错误消息一起获得的完整追溯。 When we know where you are getting the error (reading CSV file, "doing work on that data set", or in writing an XLS file using xlwt), then we can give a focused answer. 当我们知道错误的出处(读取CSV文件,“对数据集进行处理”或使用xlwt编写XLS文件)时,我们可以给出重点解决方案。

It is very possible that your input data is not all plain old ASCII. 您的输入数据很可能不是全都是旧的ASCII。 What produces it, and in what encoding? 是什么产生的,以什么编码?

To find where the problems (not necessarily errors) are, try a little script like this (untested): 要查找问题(不一定是错误)在哪里,请尝试以下类似脚本(未经测试):

import sys, glob
for pattern in sys.argv[1:]:
    for filepath in glob.glob(pattern):
        for linex, line in enumerate(open(filepath, 'r')):
            if any(c >= '\x80' for c in line):
                print "Non-ASCII in line %d of file %r" % (linex+1, filepath)
                print repr(line)

It would be useful if you showed some samples of the "bad" lines that you find, so that we can judge what the encoding might be. 如果您显示找到的“坏”行的样本,这将很有用,以便我们判断编码可能是什么。

I'm curious about using "csv.reader to pull in info from a very long sheet" -- what kind of "sheet"? 我很好奇使用“ csv.reader从很长的工作表中获取信息”的方式-什么样的“工作表”? Do you mean that you are saving an XLS file as CSV, then reading the CSV file? 您是说要先将XLS文件另存为CSV,然后再读取CSV文件吗? If so, you could use xlrd to read directly from the input XLS file, getting unicode text which you can give straight to xlwt , avoiding any encode/decode problems. 如果是这样,您可以使用xlrd直接从输入的XLS文件中读取,获取可以直接提供给xlwt unicode文本,从而避免任何编码/解码问题。

Have you worked through the tutorial from the python-excel.org site ? 您是否已完成python-excel.org网站上的教程?

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM