简体   繁体   English

即使所有内容都是unicode(python 2.7),ascii也会解码错误

[英]ascii decode error even though everything being unicode ( python 2.7)

i am running a script in dataflow (apache beam) it runs in python 2.7.12 and does some text processing with unicode strings. 我在数据流(Apache Beam)中运行脚本,它在python 2.7.12中运行,并使用unicode字符串进行一些文本处理。

Amongst the processing i do the following, where noun and phrase are unicode ( i think... ) 在处理过程中,我执行以下操作,其中名词短语是unicode(我认为...)

# -*- coding: utf-8 -*-
...
key = u"{}_{}".format(
    noun, phrase.replace(u" ", u"_")
)

However it yields ascii decode errors 但是它会产生ascii解码错误

'ascii' codec can't decode byte 0xe2 in position 1: ordinal not in range(128)

I can put in debugging and get a repr of the strings used in as noun and phrase but i currently don't have them since my logging didn't output them. 我可以进行调试并获得用作名词短语的字符串的代表,但由于日志记录未输出它们,因此我目前没有它们。

i don't understand the ascii decode error when i think i am pretty specific that i want everything in unicode! 当我认为我要用unicode编写所有内容时,我不明白ascii解码错误!

can you give some hints or should i come back with more info about the input strings? 您能否给出一些提示,还是我应该返回有关输入字符串的更多信息?

OK, so you have a non ascii character in your string. 好的,因此您的字符串中包含一个非ASCII字符。 You need to convert phrase into unicode directly 您需要直接将phrase转换为unicode

 phrase.decode('latin-1')

before manipulating in unicode.format 在以unicode.format进行操作之前

a colleague reminded me that i could always just decode the whole output, in this case being the key, to whatever format i chose. 一位同事提醒我,我总是可以将整个输出解码,在这种情况下,将其解码为我选择的任何格式。

key = u"{}_{}_{}_{}".format(
     business_unit_id, date, noun, phrase.replace(u" ", u"_")
    ).encode('ascii', 'ignore')

in the case i wanted ascii output and not care about missing chars like 💩. 在我想要ascii输出而不关心像and这样的字符的情况下。

i could also use ...).encode('utf-8') if i wanted that output in unicode. 我也可以使用...).encode('utf-8')如果我想要用Unicode输出。

in my case i settled with ascii output as the pipeline in apache beam did not seem happy with unicode keys in its map reduce pipelines 在我的情况下,我用ascii输出解决了,因为Apache Beam中的管道似乎对它的map reduce管道中的unicode键不满意。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM