[英]Encode Decode using python
I have this function in python 我在python中有此功能
Str = "ü";
print Str
def correctText( str ):
str = str.upper()
correctedText = str.decode('UTF8').encode('Windows-1252')
return correctedText;
corText = correctText(Str);
print corText
It works and converts characters like ü and é however it fails when i try Ã? 它可以工作并转换¼和Ã等字符,但是当我尝试Ã时却失败了。 and ¶
和¶
Is there a way i can fix it? 有办法解决吗?
According to UTF8, à and ¶ are not valid characters, meaning that don't have a number of bytes divisible by 4 (usually). 根据UTF8,Ã和¶不是有效字符,这意味着字节数不能被4整除(通常)。 What you need to do is either use some other kind of encoding or strip out errors in your str by using the unicode() function.
您需要做的是使用其他类型的编码,或者通过使用unicode()函数消除str中的错误。 I recommend using the ladder.
我建议使用梯子。
What you are trying to do is to compose valid UTF-8 codes by several consecutive Windows-1252 codes. 您想要做的是由几个连续的Windows-1252代码组成有效的UTF-8代码。
For example, for ü
, the Windows-1252 code of Ã
is C3
and for ¼
it's BC
. 例如,对于
ü
,的的Windows 1252代码Ã
是C3
和¼
这是BC
。 Together the code C3BC
happens to be the UTF-8 code of ü
. 代码
C3BC
恰好是ü
的UTF-8代码。
Now, for Ã?
现在,对于
Ã?
, the Windows-1252 code is C33F
, which is not a valid UTF-8 code (because the second byte does not start with 10
). ,Windows-1252代码为
C33F
,它不是有效的UTF-8代码(因为第二个字节不是以10
开头)。
Are you sure this sequence occurs in your text? 您确定此顺序出现在您的文本中吗? For example, for
à
, the Windows-1252 decoding of the UTF-8 code (C3A0) is Ã
followed by a non-printable character (non-breaking space). 例如,对于
à
,UTF-8代码(C3A0)的Windows-1252解码后跟Ã
然后是不可打印字符(不间断空格)。 So, if this second character is not printed, the ?
因此,如果第二个字符未打印,则
?
might be a regular character of the text. 可能是文本的常规字符。
For ¶
the Windows-1252 encoding is C2B6
. 对于
¶
在Windows-1252编码C2B6
。 Shouldn't it be ö
, for which the Windows-1252 encoding is C3B6
, which equals the UTF-8 code of ö
? 它不应该是
ö
,为此,在Windows 1252编码是C3B6
,相当于的UTF-8编码ö
?
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.