将非 ascii 字符从 dictreader 转换为 ascii

Question

There are many questions on python and unicode/string.关于python和unicode/string的问题很多。 However, none of the answers work for me.但是，没有一个答案对我有用。

First, a file is opened using DictReader , then each row is put into an array.首先，使用DictReader打开一个文件，然后将每一行放入一个数组中。 Then the dict value is sent to be converted to unicode.然后发送 dict 值以转换为 unicode。

Step One is getting the data第一步是获取数据

f = csv.DictReader(open(filename,"r")
data = []
for row in f:
    data.append(row)

Step Two is getting a string value from the dict and replacing the accents (found this from other posts)第二步是从字典中获取字符串值并替换重音符号（从其他帖子中找到）

s = data[i].get('Name')
strip_accents(s)

def strip_accents(s):
    try: s = unicode(s)
    except: s = s.encode('utf-8')
    s = unicodedata.normalize('NFKD', s).encode('ascii','ignore')
    return s

I use the try and except because some strings have accents, the others dont.我使用 try 和 except 因为一些字符串有重音，而其他的则没有。 What I can not figure out is, the unicode(s) works with a type str that has no accents, however, when a type str has accents, it fails我想不通是，在unicode(s)可与一个type str但是有没有口音，当一个type str有口音，它失败

UnicodeDecodeError: 'ascii' codec can't decode byte 0xfc in position 11: ordinal not in range(128)

I have seen posts on this but the answers do not work.我看过有关此的帖子，但答案不起作用。 When I use type(s), it says it is <type 'str'> .当我使用类型时，它说它是<type 'str'> 。 So I tried to read the file as unicode所以我尝试将文件读取为 unicode

f = csv.DictReader(codecs.open(filename,"r",encoding='utf-8'))

But as soon as it goes to read但是一旦开始阅读

data = []
for row in f:
    data.append(row)

This error occurs:出现此错误：

  File "F:...files.py", line 9, in files
    for row in f:
  File "C:\Python27\lib\csv.py", line 104, in next
    row = self.reader.next()
  File "C:\Python27\lib\codecs.py", line 684, in next
    return self.reader.next()
  File "C:\Python27\lib\codecs.py", line 615, in next
    line = self.readline()
  File "C:\Python27\lib\codecs.py", line 530, in readline
    data = self.read(readsize, firstline=True)
  File "C:\Python27\lib\codecs.py", line 477, in read
    newchars, decodedbytes = self.decode(data, self.errors)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xfc in position 0: invalid start byte

Is this error caused by the way dictreader is handling the unicode?这个错误是由 dictreader 处理 unicode 的方式引起的吗？ How to get around this?如何解决这个问题？

More tests.更多的测试。 As @univerio pointed out, one item which is causing the fails is ISO-8859-1正如@univerio 指出的那样，导致失败的一项是 ISO-8859-1

Modifying the open statement to:将 open 语句修改为：

f = csv.DictReader(codecs.open(filename,"r",encoding="cp1252"))

produces a slightly different error:产生稍微不同的错误：

  File "F:...files.py", line 9, in files
    for row in f:
  File "C:\Python27\lib\csv.py", line 104, in next
    row = self.reader.next()
UnicodeEncodeError: 'ascii' codec can't encode character u'\xfc' in position 11: ordinal not in range(128)

Using the basic open statement and modifying strip_accents() such as:使用基本的 open 语句并修改 strip_accents() 如：

try: s = unicode(s)
except: s = s.decode("iso-8859-1").encode('utf8')
print type(s)
s = unicodedata.normalize('NFKD', s).encode('ascii','ignore')
return str(s)

prints that the type is still str and errors on打印类型仍然是 str 和错误

s = unicodedata.normalize('NFKD', s).encode('ascii','ignore')
TypeError: must be unicode, not str

based on Python: Converting from ISO-8859-1/latin1 to UTF-8 modifying to基于Python：从 ISO-8859-1/latin1 转换为 UTF-8修改为

s = unicode(s.decode("iso-8859-1").encode('utf8'))

produces a different error:产生不同的错误：

except: s = unicode(s.decode("iso-8859-1").encode('utf8'))
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 11: ordinal not in range(128)

Answer 1

I think this should work:我认为这应该有效：

def strip_accents(s):
    s = s.decode("cp1252")  # decode from cp1252 encoding instead of the implicit ascii encoding used by unicode()
    s = unicodedata.normalize('NFKD', s).encode('ascii','ignore')
    return s

The reason opening the file with the correct encoding didn't work is because DictReader doesn't seem to handle unicode strings correctly.使用正确编码打开文件不起作用的原因是DictReader似乎没有正确处理 unicode 字符串。

Answer 2

Reference here: UnicodeEncodeError: 'ascii' codec can't encode character u'\\xef' in position 0: ordinal not in range(128) , by @Duncan 's answer,此处参考： UnicodeEncodeError: 'ascii' codec can't encode character u'\\xef' in position 0: ordinal not in range(128) ，@Duncan 的回答，

print repr(ch)

Example:例子：

string = 'Ka\u011f KO\u011e52 \u0131 \u0130\u00f6\u00d6 David \u00fc K\u00dc\u015f\u015e \u00e7 \u00c7'

print (repr(string))

It prints:它打印：

'Kağ KOĞ52 ı İöÖ David ü KÜşŞ ç Ç'

将非 ascii 字符从 dictreader 转换为 ascii

问题描述

2 个解决方案

解决方案1
1 已采纳 2014-08-28 06:39:41

解决方案2
0 2020-03-19 03:47:42

将非 ascii 字符从 dictreader 转换为 ascii

问题描述

2 个解决方案

解决方案1 1 已采纳 2014-08-28 06:39:41

解决方案2 0 2020-03-19 03:47:42

解决方案1
1 已采纳 2014-08-28 06:39:41

解决方案2
0 2020-03-19 03:47:42