Python-查找unicode / ascii问题

Question

我是csv.reader，可以从很长的表格中获取信息。 我正在对该数据集进行处理，然后使用xlwt包给我一个可行的excel文件。

但是，我收到此错误：

UnicodeDecodeError：'ascii'编解码器无法解码位置34的字节0x92：序数不在范围内（128）

我向所有人提出的问题是，如何才能找到该错误在数据集中的确切位置？ 另外，是否有一些我可以编写的代码可以浏览我的数据集并找出问题所在（因为某些数据集在运行时没有上述错误，而其他数据集却有问题）？

Answer 1

答案实际上很简单：一旦从文件中读取数据，就使用文件的编码将其转换为unicode，并处理UnicodeDecodeError异常：

try:
        # decode using utf-8 (use ascii if you want)
        unicode_data = str_data.decode("utf-8")
except UnicodeDecodeError, e:
        print "The error is there !"

这将使您免于许多麻烦； 您不必担心多字节字符编码，如果需要编写外部库（包括xlwt），它们将只执行“正确的事情”。

Python 3.0将强制要求指定字符串的编码，因此现在开始这样做是个好主意。

Answer 2

csv模块不支持unicode和null字符。 不过，您可以通过执行以下操作来替换它们（将utf-8替换为CSV数据的编码格式）：

import codecs
import csv

class AsciiFile:
    def __init__(self, path):
        self.f = codecs.open(path, 'rb', 'utf-8')

    def close(self):
        self.f.close()

    def __iter__(self):
        for line in self.f:
            # 'replace' for unicode characters -> ?, 'ignore' to ignore them
            y = line.encode('ascii', 'replace')
            y = y.replace('\0', '?') # Can't handle null characters!
            yield y

f = AsciiFile(PATH)
r = csv.reader(f)
...
f.close()

如果要查找CSV模块无法处理的字符位置，则可以执行以下操作：

import codecs

lineno = 0
f = codecs.open(PATH, 'rb', 'utf-8')
for line in f:
    for x, c in enumerate(line):
        if not c.encode('ascii', 'ignore') or c == '\0':
            print "Character ordinal %s line %s character %s is unicode or null!" % (ord(c), lineno, x)
    lineno += 1
f.close()

同样，您也可以使用我编写的此CSV开瓶器，它可以处理Unicode字符：

import codecs

def OpenCSV(Path, Encoding, Delims, StartAtRow, Qualifier, Errors):
    infile = codecs.open(Path, "rb", Encoding, errors=Errors)
    for Line in infile:
        Line = Line.strip('\r\n')
        if (StartAtRow - 1) and StartAtRow > 0: StartAtRow -= 1
        elif Qualifier != '(None)':
            # Take a note of the chars 'before' just 
            # in case of excel-style """ quoting.
            cB41 = ''; cB42 = ''
            L = ['']
            qMode = False
            for c in Line: 
                if c==Qualifier and c==cB41==cB42 and qMode:
                    # Triple qualifiers, so allow it with one
                    L[-1] = L[-1][:-2]
                    L[-1] += c
                elif c==Qualifier: 
                    # A qualifier, so reverse qual mode
                    qMode = not qMode
                elif c in Delims and not qMode: 
                    # Not in qual mode and delim
                    L.append('')
                else: 
                    # Nothing to see here, move along
                    L[-1] += c
                cB42 = cB41
                cB41 = c
            yield L
        else:
            # There aren't any qualifiers.
            cB41 = ''; cB42 = ''
            L = ['']
            for c in Line: 
                cB42 = cB41; cB41 = c
                if c in Delims: 
                    # Delim
                    L.append('')
                else: 
                    # Nothing to see here, move along
                    L[-1] += c
            yield L

for listItem in openCSV(PATH, Encoding='utf-8', Delims=[','], StartAtRow=0, Qualifier='"', Errors='replace')
    ...

Answer 3

您可以参考以下问题中的代码段，以获取支持unicode编码的csv阅读器：

Python 2.6中对csv文件的常规Unicode / UTF-8支持

Answer 4

请给出与错误消息一起获得的完整追溯。 当我们知道错误的出处（读取CSV文件，“对数据集进行处理”或使用xlwt编写XLS文件）时，我们可以给出重点解决方案。

您的输入数据很可能不是全都是旧的ASCII。 是什么产生的，以什么编码？

要查找问题（不一定是错误）在哪里，请尝试以下类似脚本（未经测试）：

import sys, glob
for pattern in sys.argv[1:]:
    for filepath in glob.glob(pattern):
        for linex, line in enumerate(open(filepath, 'r')):
            if any(c >= '\x80' for c in line):
                print "Non-ASCII in line %d of file %r" % (linex+1, filepath)
                print repr(line)

如果您显示找到的“坏”行的样本，这将很有用，以便我们判断编码可能是什么。

我很好奇使用“ csv.reader从很长的工作表中获取信息”的方式-什么样的“工作表”？ 您是说要先将XLS文件另存为CSV，然后再读取CSV文件吗？ 如果是这样，您可以使用xlrd直接从输入的XLS文件中读取，获取可以直接提供给xlwt unicode文本，从而避免任何编码/解码问题。

您是否已完成python-excel.org网站上的教程？

Python-查找unicode / ascii问题

问题描述

4 个解决方案

解决方案1
3 已采纳 2010-05-02 10:26:14

解决方案2
1 2010-05-02 10:13:33

解决方案3
0 2010-05-02 10:48:12

解决方案4
0 2010-05-02 11:43:08

Python-查找unicode / ascii问题

问题描述

4 个解决方案

解决方案1 3 已采纳 2010-05-02 10:26:14

解决方案2 1 2010-05-02 10:13:33

解决方案3 0 2010-05-02 10:48:12

解决方案4 0 2010-05-02 11:43:08

解决方案1
3 已采纳 2010-05-02 10:26:14

解决方案2
1 2010-05-02 10:13:33

解决方案3
0 2010-05-02 10:48:12

解决方案4
0 2010-05-02 11:43:08