简体   繁体   English

如何在Python中比较unicode和字符串?

[英]How to compare unicode and string in Python?

I have two variables (let's say x and y ) that have the following values: 我有两个变量(假设xy )具有以下值:

x = u'Ko\u0161ick\xfd'
y = 'Ko\x9aick\xfd'

They are presumable encoding the same name but in different way. 它们可能以相同的名称编码,但方式不同。 The first variable is unicode and the second one is a string. 第一个变量是unicode,第二个变量是字符串。

Is there a way to transform string into unicode (or unicode into string) and check if they are really the same. 有没有一种方法可以将字符串转换为unicode(或unicode转换为字符串)并检查它们是否确实相同。

I try to use encode 我尝试使用encode

x.encode('utf-8')

It returns something new (the third version): 它返回新的东西(第三个版本):

'Ko\xc5\xa1ick\xc3\xbd'

And using the following: 并使用以下命令:

print x.encode('utf-8')

returns yet another version: 返回另一个版本:

KošickÛ

So, I am totally confused. 所以,我完全感到困惑。 Is there a way to keep everything in the same format? 有没有办法使所有内容保持相同格式?

You can convert a byte string to Unicode, but if it contains any non-ASCII, characters, you have to specify the encoding. 您可以将字节字符串转换为Unicode,但是如果它包含任何非ASCII字符,则必须指定编码。

if y.decode('iso-8859-1') == x:
    print(u'{0!r} converted to Unicode == {1}".format(y, x))

With your given example, this is not true; 对于您给出的示例,这是不正确的; but perhaps y is in a different encoding. 但也许y编码不同。

In theory, you could convert either way, but generally, it makes sense to use all-Unicode internally, and convert other encodings to Unicode for use in your code (not the other way around). 从理论上讲,您可以采用任何一种方式进行转换,但是通常,在内部使用all-Unicode有意义,然后将其他编码转换为Unicode以在您的代码中使用(而不是相反)。

You need to know the encoding of the byte string. 您需要知道字节字符串的编码。 It looks like windows-1252 : 看起来像windows-1252

x = u'Ko\u0161ick\xfd'
y = 'Ko\x9aick\xfd'

print x == y.decode('windows-1252')
print x.encode('windows-1252') == y

Output: 输出:

True
True

Best practice is to convert text to Unicode on input to the program, do all the processing in Unicode, and convert back to encoded bytes to persist to storage, transmit on a socket, etc. 最佳实践是在程序输入时将文本转换为Unicode,以Unicode进行所有处理,然后转换回编码的字节以持久存储,在套接字上传输等。

Well, utf-8 is now the de facto standard for interchange and in the Linux world, but there are plenty of other encodings. 好吧,utf-8现在是事实上的交换标准,在Linux世界中也是如此,但是还有许多其他编码。

Common examples are latin1, latin9 (same with € symbol), and cp1252 a windows variant of them. 常见的示例是latin1,latin9(与€符号相同)和cp1252(它们的Windows变体)。

In your case: 在您的情况下:

>>> x.encode('cp1252')
'Ko\x9aick\xfd'

So the y strings seems to be cp1252 encoded. 因此y字符串似乎是cp1252编码的。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM