使用python读取UTF-8字符时出错

Question

我在python中有以下函数，该函数将字符串作为参数并以ASCII返回相同的字符串（例如“alçapão”->“ alcapao”）：

def filt(word):
    dic = { u'á':'a',u'ã':'a',u'â':'a' } # the whole dictionary is too big, it is just a sample
    new = ''
    for l in word:
        new = new + dic.get(l, l)
    return new

应该使用以下方法“过滤”我从文件中读取的列表中的所有字符串：

lines = []
with open("to-filter.txt","r") as f:
    for line in f:
        lines.append(line.strip())

lines = [filt(l) for l in lines]

但是我得到这个：

filt.py:9: UnicodeWarning: Unicode equal comparison failed to convert 
  both arguments to Unicode - interpreting them as being unequal 
  new = new + dic.get(l, l)

并且过滤的字符串具有类似'\\ xc3 \\ xb4'的字符，而不是ASCII字符。 我该怎么办？

Answer 1

您正在混合和匹配Unicode字符串和常规（字节）字符串。

使用io模块，可在读取文本文件时将其打开并解码为Unicode：

with io.open("to-filter.txt","r", encoding="utf-8") as f:

这假定您的to-filter.txt文件是UTF-8编码的。

您还可以使用以下命令将文件读入数组来缩小：

with io.open("to-filter.txt","r", encoding="utf-8") as f:
    lines = f.read().splitlines()

现在， lines是Unicode字符串的列表。

可选的

看起来您正在尝试将非ASCII字符转换为最接近的ASCII等效字符。 简单的方法是：

import unicodedata
def filt(word):
    return unicodedata.normalize('NFKD', word).encode('ascii', errors='ignore').decode('ascii')

这是什么：

将每个字符分解成它们的组成部分。 例如， ã可以表示为单个Unicode字符（U + 00E3'带小标题的拉丁文小写字母A'）或两个Unicode字符：U + 0061'小写拉丁文字母A'+ U + 0303'COMBINING TILDE'。
将组成部分编码为ASCII。 非ASCII部分（代码点大于U + 007F的部分）将被忽略。
为了方便起见，解码回Unicode str。

文艺青年最爱的

您的代码现在为：

import unicodedata
def filt(word):
    return unicodedata.normalize('NFKD', word).encode('ascii', errors='ignore').decode('ascii')

with io.open("to-filter.txt","r", encoding="utf-8") as f:
    lines = f.read().splitlines()

lines = [filt(l) for l in lines]

Python 3.x

尽管不是严格要求，但从open()删除io

Answer 2

问题的根源是您没有从文件中读取Unicode字符串，而是在读取字节字符串。 有三种方法可以解决此问题，第一种是按照另一个答案的建议，使用io模块打开文件。 第二种是在阅读时转换每个字符串：

with open("to-filter.txt","r") as f:
    for line in f:
        lines.append(line.decode('utf-8').strip())

第三种方法是使用Python 3，它始终将文本文件读取为Unicode字符串。

最后，无需编写您自己的代码即可将带重音的字符转换为纯ASCII，有一个包unidecode可以做到这一点。

from unidecode import unidecode
print(unidecode(line))

使用python读取UTF-8字符时出错

问题描述

2 个解决方案

解决方案1
3 2017-02-17 16:16:04

可选的

文艺青年最爱的

Python 3.x

解决方案2
-1 2017-02-17 17:05:58

使用python读取UTF-8字符时出错

问题描述

2 个解决方案

解决方案1 3 2017-02-17 16:16:04

可选的

文艺青年最爱的

Python 3.x

解决方案2 -1 2017-02-17 17:05:58

解决方案1
3 2017-02-17 16:16:04

解决方案2
-1 2017-02-17 17:05:58