简体   繁体   English

如何在Python中比较unicode和str

[英]How to compare unicode and str in Python

My code: 我的代码:

a = '汉'
b = u'汉'

These two are the same Chinese character. 这两个是相同的汉字。 But obviously, a == b is False . 但是显然, a == bFalse How do I fix this? 我该如何解决? Note, I can't convert a to utf-8 because I have no access to the code. 请注意,由于无法访问代码,因此无法将a转换为utf-8 I need to convert b to the encoding that a is using. 我需要将b转换为a正在使用的编码。

So, my question is, what do I do to turn the encoding of b into that of a ? 所以,我的问题是,我该怎么办转的编码b成的a

If you don't know a 's encoding, you'll need to: 如果你不知道a的编码,您需要:

  1. detect a 's encoding 检测a的编码
  2. encode b using the detected encoding 使用检测到的编码对b进行编码

First, to detect a 's encoding, let's use chardet . 首先,要检测a的编码,请使用chardet

$ pip install chardet

Now let's use it: 现在让我们使用它:

>>> import chardet
>>> a = '汉'
>>> chardet.detect(a)
{'confidence': 0.505, 'encoding': 'utf-8'}

So, to actually accomplish what you requested: 因此,要实际完成您的要求:

>>> encoding = chardet.detect(a)['encoding']
>>> b = u'汉'
>>> b_encoded = b.encode(encoding)
>>> a == b_encoded
True

Decode the encoded string a using str.decode : 使用str.decode解码编码的字符串a

>>> a = '汉'
>>> b = u'汉'
>>> a.decode('utf-8') == b
True

NOTE Replace utf-8 according to the source code encoding. 注意根据源代码编码替换utf-8

both a.decode and b.encode are OK: a.decodeb.encode都可以:

In [133]: a.decode('utf') == b
Out[133]: True

In [134]: b.encode('utf') == a
Out[134]: True

Note that str.encode and unicode.decode are also available, don't mix them up. 请注意, str.encodeunicode.decode也可用,请勿将它们混淆。 See What is the difference between encode/decode? 请参见编码/解码有什么区别?

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM