简体   繁体   English

如何在 Python 中将字符串转换为 utf-8

[英]How to convert a string to utf-8 in Python

I have a browser which sends utf-8 characters to my Python server, but when I retrieve it from the query string, the encoding that Python returns is ASCII.我有一个浏览器将 utf-8 字符发送到我的 Python 服务器,但是当我从查询字符串中检索它时,Python 返回的编码是 ASCII。 How can I convert the plain string to utf-8?如何将纯字符串转换为 utf-8?

NOTE: The string passed from the web is already UTF-8 encoded, I just want to make Python to treat it as UTF-8 not ASCII. NOTE: The string passed from the web is already UTF-8 encoded, I just want to make Python to treat it as UTF-8 not ASCII.

In Python 2在 Python 2 中

>>> plain_string = "Hi!"
>>> unicode_string = u"Hi!"
>>> type(plain_string), type(unicode_string)
(<type 'str'>, <type 'unicode'>)

^ This is the difference between a byte string (plain_string) and a unicode string. ^ 这是字节串 (plain_string) 和 unicode 串的区别。

>>> s = "Hello!"
>>> u = unicode(s, "utf-8")

^ Converting to unicode and specifying the encoding. ^ 转换为 unicode 并指定编码。

In Python 3在 Python 3 中

All strings are unicode.所有字符串都是Unicode。 The unicode function does not exist anymore. unicode函数不再存在。 See answer from @Noumenon请参阅@Noumenon 的回答

如果上述方法不起作用,您还可以告诉 Python 忽略无法转换为 utf-8 的字符串部分:

stringnamehere.decode('utf-8', 'ignore')

Might be a bit overkill, but when I work with ascii and unicode in same files, repeating decode can be a pain, this is what I use:可能有点矫枉过正,但是当我在同一个文件中使用 ascii 和 unicode 时,重复解码可能会很痛苦,这就是我使用的:

def make_unicode(inp):
    if type(inp) != unicode:
        inp =  inp.decode('utf-8')
    return inp

Adding the following line to the top of your .py file:将以下行添加到 .py 文件的顶部:

# -*- coding: utf-8 -*-

allows you to encode strings directly in your script, like this:允许您直接在脚本中编码字符串,如下所示:

utfstr = "ボールト"

If I understand you correctly, you have a utf-8 encoded byte-string in your code.如果我理解正确,您的代码中有一个 utf-8 编码的字节字符串。

Converting a byte-string to a unicode string is known as decoding (unicode -> byte-string is encoding).将字节字符串转换为 unicode 字符串称为解码(unicode -> 字节字符串是编码)。

You do that by using the unicode function or the decode method.您可以通过使用unicode函数或decode方法来做到这一点。 Either:任何一个:

unicodestr = unicode(bytestr, encoding)
unicodestr = unicode(bytestr, "utf-8")

Or:或者:

unicodestr = bytestr.decode(encoding)
unicodestr = bytestr.decode("utf-8")
city = 'Ribeir\xc3\xa3o Preto'
print city.decode('cp1252').encode('utf-8')

In Python 3.6, they do not have a built-in unicode() method.在 Python 3.6 中,它们没有内置的 unicode() 方法。 Strings are already stored as unicode by default and no conversion is required.默认情况下,字符串已存储为 unicode,无需转换。 Example:例子:

my_str = "\u221a25"
print(my_str)
>>> √25

Translate with ord() and unichar().使用 ord() 和 unichar() 进行翻译。 Every unicode char have a number asociated, something like an index.每个 unicode char 都有一个关联的数字,类似于索引。 So Python have a few methods to translate between a char and his number.所以 Python 有一些方法可以在字符和他的数字之间进行转换。 Downside is a ñ example.缺点是一个例子。 Hope it can help.希望它能有所帮助。

>>> C = 'ñ'
>>> U = C.decode('utf8')
>>> U
u'\xf1'
>>> ord(U)
241
>>> unichr(241)
u'\xf1'
>>> print unichr(241).encode('utf8')
ñ
  • First, str in Python is represented in Unicode .首先,Python 中的strUnicode表示。
  • Second, UTF-8 is an encoding standard to encode Unicode string to bytes .其次, UTF-8是一种编码标准,用于将Unicode字符串编码为bytes There are many encoding standards out there (eg UTF-16 , ASCII , SHIFT-JIS , etc.).有许多编码标准(例如UTF-16ASCIISHIFT-JIS等)。

When the client sends data to your server and they are using UTF-8 , they are sending a bunch of bytes not str .当客户端将数据发送到您的服务器并且他们使用UTF-8 ,他们发送的是一堆bytes而不是str

You received a str because the "library" or "framework" that you are using, has implicitly converted some random bytes to str .您收到str是因为您使用的“库”或“框架”已隐式将一些随机bytes转换为str

Under the hood, there is just a bunch of bytes .在引擎盖下,只有一堆bytes You just need ask the "library" to give you the request content in bytes and you will handle the decoding yourself (if library can't give you then it is trying to do black magic then you shouldn't use it).您只需要要求“库”以bytes为您提供请求内容,您将自己处理解码(如果库不能给您,那么它正在尝试做黑魔法,那么您不应该使用它)。

  • Decode UTF-8 encoded bytes to str : bs.decode('utf-8')UTF-8编码的bytes解码为strbs.decode('utf-8')
  • Encode str to UTF-8 bytes : s.encode('utf-8')str编码为UTF-8 bytess.encode('utf-8')

you can also do this:你也可以这样做:

from unidecode import unidecode
unidecode(yourStringtoDecode)

You can use python's standard library codecs module .您可以使用 python 的标准库codecs module

import codecs
codecs.decode(b'Decode me', 'utf-8')

The url is translated to ASCII and to the Python server it is just a Unicode string, eg.: "T%C3%A9st%C3%A3o" url 被翻译成 ASCII 码和 Python 服务器,它只是一个 Unicode 字符串,例如:“3%3o%C3%A9st%”

Python understands "é" and "ã" as actual %C3%A9 and %C3%A3. Python 将“é”和“ã”理解为实际的 %C3%A9 和 %C3%A3。

You can encode an URL just like this:您可以像这样对 URL 进行编码:

import urllib
url = "T%C3%A9st%C3%A3o"
print(urllib.parse.unquote(url))
>> Téstão

See https://www.adamsmith.haus/python/answers/how-to-decode-a-utf-8-url-in-python for details.有关详细信息,请参阅https://www.adamsmith.haus/python/answers/how-to-decode-a-utf-8-url-in-python

Yes, You can add是的,您可以添加

# -*- coding: utf-8 -*-

in your source code's first line.在源代码的第一行。

You can read more details here https://www.python.org/dev/peps/pep-0263/您可以在此处阅读更多详细信息https://www.python.org/dev/peps/pep-0263/

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM