简体   繁体   中英

Is this the best way to ensure that a python unicode “string” is encoded in utf-8?

Given in arbitrary "string" from a library I do not have control over, I want to make sure the "string" is a unicode type and encoded in utf-8. I would like to know if this is the best way to do this:

import types

input = <some value from a lib I dont have control over>

if isinstance(input, types.StringType):
    input = input.decode("utf-8")
elif isinstance(input, types.UnicodeType):
    input = input.encode("utf-8").decode("utf-8")

In my actual code I wrap this in a try/except and handle the errors but I left that part out.

A Unicode object is not encoded (it is internally but this should be transparent to you as a Python user). The line input.encode("utf-8").decode("utf-8") does not make much sense: you get the exact same sequence of Unicode characters at the end that you had in the beginning.

if isinstance(input, str):
    input = input.decode('utf-8')

is all you need to ensure that str objects (byte strings) are converted into Unicode strings.

Simply;

try:
    input = unicode(input.encode('utf-8'))
except ValueError:
    pass

Its always better to seek forgiveness than ask permission.

I think you have a misunderstanding of Unicode and encodings. Unicode characters are just numbers. Encodings are the representation of the numbers. Think of Unicode characters as a concept like fifteen, and encodings as 15, 1111, F, XV. You have to know the encoding (decimal, binary, hexadecimal, roman numerals) before you can decode an encoding and "know" the Unicode value.

If you have no control over the input string, it is difficult to convert it to anything. For example, if the input was read from a file you'd have to know the encoding of the text file to decode it meaningfully to Unicode, and then encode it into 'UTF-8' for your C++ library.

Are you sure you want a UTF-8 encoded sequence stored in a Unicode type? Normally, Python stores characters in a types.UnicodeType using UCS-2 or -4, what is sometimes referred to as "wide" characters, which should be capable of containing characters from all reasonably common scripts.

One wonders what sort of lib this is that sometimes outputs types.StringType and sometimes types.UnicodeType. If I would take a wild guess, the lib always produces type.StringType, but doesn't tell which encoding it is in. If that is the case, you are actually looking for code that can guess what charset a type.StringType is encoded as.

In most cases, this is easy as you can assume that it is either in eg latin-1 or UTF-8. If the text can actually be in any odd encoding (eg incoming mail w/o proper header) you need a lib that guesses encoding. See http://chardet.feedparser.org/ .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM