I have this function in python
Str = "ü";
print Str
def correctText( str ):
str = str.upper()
correctedText = str.decode('UTF8').encode('Windows-1252')
return correctedText;
corText = correctText(Str);
print corText
It works and converts characters like ü and é however it fails when i try � and ¶
Is there a way i can fix it?
According to UTF8, à and ¶ are not valid characters, meaning that don't have a number of bytes divisible by 4 (usually). What you need to do is either use some other kind of encoding or strip out errors in your str by using the unicode() function. I recommend using the ladder.
What you are trying to do is to compose valid UTF-8 codes by several consecutive Windows-1252 codes.
For example, for ü
, the Windows-1252 code of Ã
is C3
and for ¼
it's BC
. Together the code C3BC
happens to be the UTF-8 code of ü
.
Now, for Ã?
, the Windows-1252 code is C33F
, which is not a valid UTF-8 code (because the second byte does not start with 10
).
Are you sure this sequence occurs in your text? For example, for à
, the Windows-1252 decoding of the UTF-8 code (C3A0) is Ã
followed by a non-printable character (non-breaking space). So, if this second character is not printed, the ?
might be a regular character of the text.
For ¶
the Windows-1252 encoding is C2B6
. Shouldn't it be ö
, for which the Windows-1252 encoding is C3B6
, which equals the UTF-8 code of ö
?
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.