读取 .json 文件并将 unicode 数据转换为 utf-8

Question

I never really understood how encoding and decoding works in python and I am used to come across this type of problems frequently.我从来没有真正理解在 python 中编码和解码是如何工作的，我经常遇到这种类型的问题。 I have to read a json file and compare some of its values with other data.我必须读取一个 json 文件并将其中的一些值与其他数据进行比较。

In one of the files I have the string BAIXA DA INSCRI\Ç\ÃO ESTADUAL which should become BAIXA DA INSCRICAO ESTADUAL .在其中一个文件中，我有字符串BAIXA DA INSCRI\Ç\ÃO ESTADUAL它应该成为BAIXA DA INSCRICAO ESTADUAL 。 I am reading the file like this:我正在阅读这样的文件：

with codecs.open(filepath, 'r') as file:
    filedata = json.loads(file.read())

However the string is read as unicode and represented like u'BAIXA DA INSCRI\\xc7\\xc3O ESTADUAL'然而，该字符串被读取为 unicode 并表示为u'BAIXA DA INSCRI\\xc7\\xc3O ESTADUAL'

How can I make this happen, and how is the proper way to work with codecs in python?我怎样才能做到这一点，在 python 中使用编解码器的正确方法是什么？

Answer 1

It look like you want to remove any diacritics from your text.看起来您想从文本中删除任何变音符号。 You can try to use the normal form D (for decomposed) of unicode and filter out high codes:您可以尝试使用unicode的范式D（用于分解）并过滤掉高位代码：

txt = u'BAIXA DA INSCRI\xc7\xc3O ESTADUAL'
txt = u''.join(i for i in unicodedata.normalize('NFD', t) if ord(i) < 128).encode('ASCII')

It should give the (byte) string:它应该给出（字节）字符串：

'BAIXA DA INSCRICAO ESTADUAL'

读取 .json 文件并将 unicode 数据转换为 utf-8

问题描述

1 个解决方案

解决方案1
1 已采纳 2020-04-15 14:24:44

读取 .json 文件并将 unicode 数据转换为 utf-8

问题描述

1 个解决方案

解决方案1 1 已采纳 2020-04-15 14:24:44

解决方案1
1 已采纳 2020-04-15 14:24:44