[英]How to convert a string to utf-8 in Python
I have a browser which sends utf-8 characters to my Python server, but when I retrieve it from the query string, the encoding that Python returns is ASCII.我有一个浏览器将 utf-8 字符发送到我的 Python 服务器,但是当我从查询字符串中检索它时,Python 返回的编码是 ASCII。 How can I convert the plain string to utf-8?
如何将纯字符串转换为 utf-8?
NOTE: The string passed from the web is already UTF-8 encoded, I just want to make Python to treat it as UTF-8 not ASCII. NOTE: The string passed from the web is already UTF-8 encoded, I just want to make Python to treat it as UTF-8 not ASCII.
>>> plain_string = "Hi!"
>>> unicode_string = u"Hi!"
>>> type(plain_string), type(unicode_string)
(<type 'str'>, <type 'unicode'>)
^ This is the difference between a byte string (plain_string) and a unicode string. ^ 这是字节串 (plain_string) 和 unicode 串的区别。
>>> s = "Hello!"
>>> u = unicode(s, "utf-8")
^ Converting to unicode and specifying the encoding. ^ 转换为 unicode 并指定编码。
All strings are unicode.所有字符串都是Unicode。 The
unicode
function does not exist anymore. unicode
函数不再存在。 See answer from @Noumenon请参阅@Noumenon 的回答
如果上述方法不起作用,您还可以告诉 Python 忽略无法转换为 utf-8 的字符串部分:
stringnamehere.decode('utf-8', 'ignore')
Might be a bit overkill, but when I work with ascii and unicode in same files, repeating decode can be a pain, this is what I use:可能有点矫枉过正,但是当我在同一个文件中使用 ascii 和 unicode 时,重复解码可能会很痛苦,这就是我使用的:
def make_unicode(inp):
if type(inp) != unicode:
inp = inp.decode('utf-8')
return inp
Adding the following line to the top of your .py file:将以下行添加到 .py 文件的顶部:
# -*- coding: utf-8 -*-
allows you to encode strings directly in your script, like this:允许您直接在脚本中编码字符串,如下所示:
utfstr = "ボールト"
If I understand you correctly, you have a utf-8 encoded byte-string in your code.如果我理解正确,您的代码中有一个 utf-8 编码的字节字符串。
Converting a byte-string to a unicode string is known as decoding (unicode -> byte-string is encoding).将字节字符串转换为 unicode 字符串称为解码(unicode -> 字节字符串是编码)。
You do that by using the unicode function or the decode method.您可以通过使用unicode函数或decode方法来做到这一点。 Either:
任何一个:
unicodestr = unicode(bytestr, encoding)
unicodestr = unicode(bytestr, "utf-8")
Or:或者:
unicodestr = bytestr.decode(encoding)
unicodestr = bytestr.decode("utf-8")
city = 'Ribeir\xc3\xa3o Preto'
print city.decode('cp1252').encode('utf-8')
In Python 3.6, they do not have a built-in unicode() method.在 Python 3.6 中,它们没有内置的 unicode() 方法。 Strings are already stored as unicode by default and no conversion is required.
默认情况下,字符串已存储为 unicode,无需转换。 Example:
例子:
my_str = "\u221a25"
print(my_str)
>>> √25
Translate with ord() and unichar().使用 ord() 和 unichar() 进行翻译。 Every unicode char have a number asociated, something like an index.
每个 unicode char 都有一个关联的数字,类似于索引。 So Python have a few methods to translate between a char and his number.
所以 Python 有一些方法可以在字符和他的数字之间进行转换。 Downside is a ñ example.
缺点是一个例子。 Hope it can help.
希望它能有所帮助。
>>> C = 'ñ'
>>> U = C.decode('utf8')
>>> U
u'\xf1'
>>> ord(U)
241
>>> unichr(241)
u'\xf1'
>>> print unichr(241).encode('utf8')
ñ
str
in Python is represented in Unicode
.str
用Unicode
表示。UTF-8
is an encoding standard to encode Unicode
string to bytes
.UTF-8
是一种编码标准,用于将Unicode
字符串编码为bytes
。 There are many encoding standards out there (eg UTF-16
, ASCII
, SHIFT-JIS
, etc.).UTF-16
、 ASCII
、 SHIFT-JIS
等)。 When the client sends data to your server and they are using UTF-8
, they are sending a bunch of bytes
not str
.当客户端将数据发送到您的服务器并且他们使用
UTF-8
,他们发送的是一堆bytes
而不是str
。
You received a str
because the "library" or "framework" that you are using, has implicitly converted some random bytes
to str
.您收到
str
是因为您使用的“库”或“框架”已隐式将一些随机bytes
转换为str
。
Under the hood, there is just a bunch of bytes
.在引擎盖下,只有一堆
bytes
。 You just need ask the "library" to give you the request content in bytes
and you will handle the decoding yourself (if library can't give you then it is trying to do black magic then you shouldn't use it).您只需要要求“库”以
bytes
为您提供请求内容,您将自己处理解码(如果库不能给您,那么它正在尝试做黑魔法,那么您不应该使用它)。
UTF-8
encoded bytes
to str
: bs.decode('utf-8')
UTF-8
编码的bytes
解码为str
: bs.decode('utf-8')
str
to UTF-8
bytes
: s.encode('utf-8')
str
编码为UTF-8
bytes
: s.encode('utf-8')
you can also do this:你也可以这样做:
from unidecode import unidecode
unidecode(yourStringtoDecode)
You can use python's standard library codecs
module .您可以使用 python 的标准库
codecs
module 。
import codecs
codecs.decode(b'Decode me', 'utf-8')
The url is translated to ASCII and to the Python server it is just a Unicode string, eg.: "T%C3%A9st%C3%A3o" url 被翻译成 ASCII 码和 Python 服务器,它只是一个 Unicode 字符串,例如:“3%3o%C3%A9st%”
Python understands "é" and "ã" as actual %C3%A9 and %C3%A3. Python 将“é”和“ã”理解为实际的 %C3%A9 和 %C3%A3。
You can encode an URL just like this:您可以像这样对 URL 进行编码:
import urllib
url = "T%C3%A9st%C3%A3o"
print(urllib.parse.unquote(url))
>> Téstão
See https://www.adamsmith.haus/python/answers/how-to-decode-a-utf-8-url-in-python for details.有关详细信息,请参阅https://www.adamsmith.haus/python/answers/how-to-decode-a-utf-8-url-in-python 。
Yes, You can add是的,您可以添加
# -*- coding: utf-8 -*-
in your source code's first line.在源代码的第一行。
You can read more details here https://www.python.org/dev/peps/pep-0263/您可以在此处阅读更多详细信息https://www.python.org/dev/peps/pep-0263/
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.