简体   繁体   English

读取 .json 文件并将 unicode 数据转换为 utf-8

[英]Reading .json file and converting unicode data to utf-8

I never really understood how encoding and decoding works in python and I am used to come across this type of problems frequently.我从来没有真正理解在 python 中编码和解码是如何工作的,我经常遇到这种类型的问题。 I have to read a json file and compare some of its values with other data.我必须读取一个 json 文件并将其中的一些值与其他数据进行比较。

In one of the files I have the string BAIXA DA INSCRI\Ç\ÃO ESTADUAL which should become BAIXA DA INSCRICAO ESTADUAL .在其中一个文件中,我有字符串BAIXA DA INSCRI\Ç\ÃO ESTADUAL它应该成为BAIXA DA INSCRICAO ESTADUAL I am reading the file like this:我正在阅读这样的文件:

with codecs.open(filepath, 'r') as file:
    filedata = json.loads(file.read())

However the string is read as unicode and represented like u'BAIXA DA INSCRI\\xc7\\xc3O ESTADUAL'然而,该字符串被读取为 unicode 并表示为u'BAIXA DA INSCRI\\xc7\\xc3O ESTADUAL'

How can I make this happen, and how is the proper way to work with codecs in python?我怎样才能做到这一点,在 python 中使用编解码器的正确方法是什么?

It look like you want to remove any diacritics from your text.看起来您想从文本中删除任何变音符号。 You can try to use the normal form D (for decomposed) of unicode and filter out high codes:您可以尝试使用unicode的范式D(用于分解)并过滤掉高位代码:

txt = u'BAIXA DA INSCRI\xc7\xc3O ESTADUAL'
txt = u''.join(i for i in unicodedata.normalize('NFD', t) if ord(i) < 128).encode('ASCII')

It should give the (byte) string:它应该给出(字节)字符串:

'BAIXA DA INSCRICAO ESTADUAL'

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM