如何检查字符串是 unicode 还是 ascii？

Question

我必须在 Python 中做什么才能确定字符串具有哪种编码？

Answer 1

在 Python 3 中，所有字符串都是 Unicode 字符序列。 有一个bytes类型保存原始字节。

在 Python 2 中，字符串可能是str类型或unicode类型。 您可以通过以下方式判断使用代码：

def whatisthis(s):
    if isinstance(s, str):
        print "ordinary string"
    elif isinstance(s, unicode):
        print "unicode string"
    else:
        print "not a string"

这不区分“Unicode 或 ASCII”； 它只区分 Python 类型。 Unicode 字符串可能仅由 ASCII 范围内的字符组成，而字节字符串可能包含 ASCII、编码的 Unicode 甚至非文本数据。

Answer 2

如何判断一个对象是一个 unicode 字符串还是一个字节字符串

您可以使用type或isinstance 。

在 Python 2 中：

>>> type(u'abc')  # Python 2 unicode string literal
<type 'unicode'>
>>> type('abc')   # Python 2 byte string literal
<type 'str'>

在 Python 2 中， str只是一个字节序列。 Python 不知道它的编码是什么。 unicode类型是存储文本的更安全的方式。 如果你想更深入地了解这一点，我推荐http://farmdev.com/talks/unicode/ 。

在 Python 3 中：

>>> type('abc')   # Python 3 unicode string literal
<class 'str'>
>>> type(b'abc')  # Python 3 byte string literal
<class 'bytes'>

在 Python 3 中， str就像 Python 2 的unicode ，用于存储文本。 在 Python 2 中称为str在 Python 3 中称为bytes 。

如何判断字节字符串是否有效 utf-8 或 ascii

您可以调用decode 。 如果它引发 UnicodeDecodeError 异常，则它无效。

>>> u_umlaut = b'\xc3\x9c'   # UTF-8 representation of the letter 'Ü'
>>> u_umlaut.decode('utf-8')
u'\xdc'
>>> u_umlaut.decode('ascii')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)

Answer 3

在 python 3.x 中，所有字符串都是 Unicode 字符序列。 并对 str 进行 isinstance 检查（默认情况下意味着 unicode 字符串）就足够了。

isinstance(x, str)

关于 python 2.x，大多数人似乎都在使用具有两个检查的 if 语句。 一种用于 str ，一种用于 unicode。

如果你想用一个语句检查你是否有一个“类似字符串”的对象，你可以执行以下操作：

isinstance(x, basestring)

Answer 4

Unicode 不是一种编码——引用 Kumar McMillan 的话：

如果 ASCII、UTF-8 和其他字节字符串是“文本”...

...那么 Unicode 是“文本性”；

它是文本的抽象形式

阅读 McMillan在 Python 中的Unicode，完全揭开PyCon 2008 的神秘面纱，它比 Stack Overflow 上的大多数相关答案更好地解释了事情。

Answer 5

如果你的代码需要与双方的Python 2和Python 3兼容，你不能直接使用的东西像isinstance(s,bytes)或isinstance(s,unicode)无/包裹它们可尝试不同的或Python版本的测试，因为bytes在 Python 2 中未定义，而unicode在 Python 3 中未定义。

有一些丑陋的解决方法。 一个极其丑陋的方法是比较类型的名称，而不是比较类型本身。 下面是一个例子：

# convert bytes (python 3) or unicode (python 2) to str
if str(type(s)) == "<class 'bytes'>":
    # only possible in Python 3
    s = s.decode('ascii')  # or  s = str(s)[2:-1]
elif str(type(s)) == "<type 'unicode'>":
    # only possible in Python 2
    s = str(s)

一个可以说稍微不那么难看的解决方法是检查 Python 版本号，例如：

if sys.version_info >= (3,0,0):
    # for Python 3
    if isinstance(s, bytes):
        s = s.decode('ascii')  # or  s = str(s)[2:-1]
else:
    # for Python 2
    if isinstance(s, unicode):
        s = str(s)

这些都是非pythonic的，大多数时候可能有更好的方法。

Answer 6

使用：

import six
if isinstance(obj, six.text_type)

在六个库中，它表示为：

if PY3:
    string_types = str,
else:
    string_types = basestring,

Answer 7

请注意，在 Python 3 上，说以下任何一项都不公平：

str是任何 x 的 UTFx（例如 UTF8）
str是 Unicode
str是 Unicode 字符的有序集合

Python 的str类型（通常）是一系列 Unicode 代码点，其中一些映射到字符。

即使在 Python 3 上，回答这个问题也没有你想象的那么简单。

测试 ASCII 兼容字符串的一种明显方法是尝试编码：

"Hello there!".encode("ascii")
#>>> b'Hello there!'

"Hello there... ☃!".encode("ascii")
#>>> Traceback (most recent call last):
#>>>   File "", line 4, in <module>
#>>> UnicodeEncodeError: 'ascii' codec can't encode character '\u2603' in position 15: ordinal not in range(128)

错误区分情况。

在 Python 3 中，甚至有些字符串包含无效的 Unicode 代码点：

"Hello there!".encode("utf8")
#>>> b'Hello there!'

"\udcc3".encode("utf8")
#>>> Traceback (most recent call last):
#>>>   File "", line 19, in <module>
#>>> UnicodeEncodeError: 'utf-8' codec can't encode character '\udcc3' in position 0: surrogates not allowed

使用相同的方法来区分它们。

Answer 8

这可能对其他人有所帮助，我开始测试变量 s 的字符串类型，但对于我的应用程序，将 s 简单地返回为 utf-8 更有意义。 进程调用 return_utf，然后知道它在处理什么并且可以适当地处理字符串。 代码不是原始的，但我打算让它与 Python 版本无关，无需版本测试或导入 6。 请评论对以下示例代码的改进，以帮助其他人。

def return_utf(s):
    if isinstance(s, str):
        return s.encode('utf-8')
    if isinstance(s, (int, float, complex)):
        return str(s).encode('utf-8')
    try:
        return s.encode('utf-8')
    except TypeError:
        try:
            return str(s).encode('utf-8')
        except AttributeError:
            return s
    except AttributeError:
        return s
    return s # assume it was already utf-8

Answer 9

您可以使用Universal Encoding Detector ，但请注意，它只会为您提供最佳猜测，而不是实际编码，因为例如不可能知道字符串“abc”的编码。 您将需要在别处获取编码信息，例如 HTTP 协议为此使用 Content-Type 标头。

Answer 10

对于 py2/py3 兼容性，只需使用

import six if isinstance(obj, six.text_type)

Answer 11

一种简单的方法是检查unicode是否是内置函数。 如果是这样，那么您在 Python 2 中并且您的字符串将是一个字符串。 为了确保一切都在unicode你可以这样做：

import builtins

i = 'cats'
if 'unicode' in dir(builtins):     # True in python 2, False in 3
  i = unicode(i)

Answer 12

在 Python-3 中，我必须了解 string 是像b='\\x7f\\x00\\x00\\x01'还是b='127.0.0.1'我的解决方案是这样的：

def get_str(value):
    str_value = str(value)
    
    if str_value.isprintable():
        return str_value

    return '.'.join(['%d' % x for x in value])

为我工作，我希望为需要的人工作

如何检查字符串是 unicode 还是 ascii？

问题描述

12 个解决方案

解决方案1
313 已采纳 2011-02-13 22:40:50

解决方案2
130 2011-02-13 22:33:39

如何判断一个对象是一个 unicode 字符串还是一个字节字符串

如何判断字节字符串是否有效 utf-8 或 ascii

解决方案3
48 2013-09-09 20:24:54

解决方案4
33 2012-05-21 14:12:19

解决方案5
24 2012-08-14 12:33:05

解决方案6
12 2016-08-08 08:50:49

解决方案7
4 2014-07-09 02:35:59

解决方案8
3 2015-12-23 22:16:43

解决方案9
2 2011-02-13 22:34:55

解决方案10
0 2018-05-28 11:56:41

解决方案11
0 2019-09-18 14:24:38

解决方案12
0 2021-04-07 16:05:45

如何检查字符串是 unicode 还是 ascii？

问题描述

12 个解决方案

解决方案1 313 已采纳 2011-02-13 22:40:50

解决方案2 130 2011-02-13 22:33:39

如何判断一个对象是一个 unicode 字符串还是一个字节字符串

如何判断字节字符串是否有效 utf-8 或 ascii

解决方案3 48 2013-09-09 20:24:54

解决方案4 33 2012-05-21 14:12:19

解决方案5 24 2012-08-14 12:33:05

解决方案6 12 2016-08-08 08:50:49

解决方案7 4 2014-07-09 02:35:59

解决方案8 3 2015-12-23 22:16:43

解决方案9 2 2011-02-13 22:34:55

解决方案10 0 2018-05-28 11:56:41

解决方案11 0 2019-09-18 14:24:38

解决方案12 0 2021-04-07 16:05:45

解决方案1
313 已采纳 2011-02-13 22:40:50

解决方案2
130 2011-02-13 22:33:39

解决方案3
48 2013-09-09 20:24:54

解决方案4
33 2012-05-21 14:12:19

解决方案5
24 2012-08-14 12:33:05

解决方案6
12 2016-08-08 08:50:49

解决方案7
4 2014-07-09 02:35:59

解决方案8
3 2015-12-23 22:16:43

解决方案9
2 2011-02-13 22:34:55

解决方案10
0 2018-05-28 11:56:41

解决方案11
0 2019-09-18 14:24:38

解决方案12
0 2021-04-07 16:05:45