简体   繁体   English

字符串和字节字符串有什么区别?

[英]What is the difference between a string and a byte string?

I am working with a library which returns a "byte string" ( bytes ) and I need to convert this to a string.我正在使用一个返回“字节字符串”( bytes )的库,我需要将其转换为字符串。

Is there actually a difference between those two things?这两件事之间真的有区别吗? How are they related, and how can I do the conversion?它们是如何相关的,我该如何进行转换?

The only thing that a computer can store is bytes.计算机可以存储的唯一内容是字节。

To store anything in a computer, you must first encode it, ie convert it to bytes.要将任何内容存储在计算机中,您必须首先对其进行编码,即将其转换为字节。 For example:例如:

  • If you want to store music, you must first encode it using MP3 , WAV , etc.如果要存储音乐,必须先使用MP3WAV等对其进行编码
  • If you want to store a picture, you must first encode it using PNG , JPEG , etc.如果要存储图片,必须首先使用PNGJPEG等对其进行编码
  • If you want to store text, you must first encode it using ASCII , UTF-8 , etc.如果要存储文本,必须首先使用ASCIIUTF-8等对其进行编码

MP3 , WAV , PNG , JPEG , ASCII and UTF-8 are examples of encodings . MP3WAVPNGJPEGASCIIUTF-8编码的例子。 An encoding is a format to represent audio, images, text, etc in bytes.编码是一种以字节为单位表示音频、图像、文本等的格式。

In Python, a byte string is just that: a sequence of bytes.在 Python 中,字节字符串就是:字节序列。 It isn't human-readable.它不是人类可读的。 Under the hood, everything must be converted to a byte string before it can be stored in a computer.在幕后,所有内容都必须转换为字节字符串,然后才能存储在计算机中。

On the other hand, a character string, often just called a "string", is a sequence of characters.另一方面,字符串,通常简称为“字符串”,是一个字符序列。 It is human-readable.它是人类可读的。 A character string can't be directly stored in a computer, it has to be encoded first (converted into a byte string).字符串不能直接存储在计算机中,必须先进行编码(转换为字节串)。 There are multiple encodings through which a character string can be converted into a byte string, such as ASCII and UTF-8 .有多种编码方式可以将字符串转换为字节字符串,例如ASCIIUTF-8

'I am a string'.encode('ASCII')

The above Python code will encode the string 'I am a string' using the encoding ASCII .上面的 Python 代码将使用编码ASCII对字符串'I am a string'进行编码。 The result of the above code will be a byte string.上面代码的结果将是一个字节串。 If you print it, Python will represent it as b'I am a string' .如果你打印它,Python 会将它表示为b'I am a string' Remember, however, that byte strings aren't human-readable , it's just that Python decodes them from ASCII when you print them.但是请记住,字节字符串不是人类可读的,只是当您打印它们时 Python 将它们从ASCII解码。 In Python, a byte string is represented by a b , followed by the byte string's ASCII representation.在 Python 中,字节字符串由b表示,后跟字节字符串的ASCII表示。

A byte string can be decoded back into a character string, if you know the encoding that was used to encode it.如果您知道用于对其进行编码的编码,则可以将字节字符串解码回字符串。

b'I am a string'.decode('ASCII')

The above code will return the original string 'I am a string' .上面的代码将返回原始字符串'I am a string'

Encoding and decoding are inverse operations.编码和解码是逆运算。 Everything must be encoded before it can be written to disk, and it must be decoded before it can be read by a human.一切都必须在写入磁盘之前进行编码,并且必须在人类读取之前进行解码。

Assuming Python 3 (in Python 2, this difference is a little less well-defined) - a string is a sequence of characters, ie unicode codepoints ;假设 Python 3(在 Python 2 中,这种差异的定义不太明确) - 字符串是一个字符序列,即unicode 代码点 these are an abstract concept, and can't be directly stored on disk.这些是一个抽象的概念,不能直接存储在磁盘上。 A byte string is a sequence of, unsurprisingly, bytes - things that can be stored on disk.毫不奇怪,字节字符串是一系列字节 -可以存储在磁盘上的东西。 The mapping between them is an encoding - there are quite a lot of these (and infinitely many are possible) - and you need to know which applies in the particular case in order to do the conversion, since a different encoding may map the same bytes to a different string:它们之间的映射是一种编码- 有很多(并且可能有无数种) - 你需要知道哪个适用于特定情况才能进行转换,因为不同的编码可能映射相同的字节到不同的字符串:

>>> b'\xcf\x84o\xcf\x81\xce\xbdo\xcf\x82'.decode('utf-16')
'蓏콯캁澽苏'
>>> b'\xcf\x84o\xcf\x81\xce\xbdo\xcf\x82'.decode('utf-8')
'τoρνoς'

Once you know which one to use, you can use the .decode() method of the byte string to get the right character string from it as above.一旦知道要使用哪个,就可以使用字节字符串的.decode()方法从中获取正确的字符串,如上所示。 For completeness, the .encode() method of a character string goes the opposite way:为了完整.encode() ,字符串的.encode()方法采用相反的方式:

>>> 'τoρνoς'.encode('utf-8')
b'\xcf\x84o\xcf\x81\xce\xbdo\xcf\x82'

Note: I will elaborate more my answer for Python 3 since the end of life of Python 2 is very close.注意:我将详细阐述我对 Python 3 的回答,因为 Python 2 的生命周期即将结束。

In Python 3在 Python 3 中

bytes consists of sequences of 8-bit unsigned values, while str consists of sequences of Unicode code points that represent textual characters from human languages. bytes由 8 位无符号值序列组成,而str由代表人类语言文本字符的 Unicode 代码点序列组成。

>>> # bytes
>>> b = b'h\x65llo'
>>> type(b)
<class 'bytes'>
>>> list(b)
[104, 101, 108, 108, 111]
>>> print(b)
b'hello'
>>>
>>> # str
>>> s = 'nai\u0308ve'
>>> type(s)
<class 'str'>
>>> list(s)
['n', 'a', 'i', '̈', 'v', 'e']
>>> print(s)
naïve

Even though bytes and str seem to work the same way, their instances are not compatible with each other, ie, bytes and str instances can't be used together with operators like > and + .尽管bytesstr似乎以相同的方式工作,但它们的实例彼此不兼容,即bytesstr实例不能与诸如>+类的运算符一起使用。 In addition, keep in mind that comparing bytes and str instances for equality, ie using == , will always evaluate to False even when they contain exactly the same characters.此外,请记住,比较bytesstr实例是否相等,即使用== ,即使它们包含完全相同的字符,也将始终评估为False

>>> # concatenation
>>> b'hi' + b'bye' # this is possible
b'hibye'
>>> 'hi' + 'bye' # this is also possible
'hibye'
>>> b'hi' + 'bye' # this will fail
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: can't concat str to bytes
>>> 'hi' + b'bye' # this will also fail
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: can only concatenate str (not "bytes") to str
>>>
>>> # comparison
>>> b'red' > b'blue' # this is possible
True
>>> 'red'> 'blue' # this is also possible
True
>>> b'red' > 'blue' # you can't compare bytes with str
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: '>' not supported between instances of 'bytes' and 'str'
>>> 'red' > b'blue' # you can't compare str with bytes
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: '>' not supported between instances of 'str' and 'bytes'
>>> b'blue' == 'red' # equality between str and bytes always evaluates to False
False
>>> b'blue' == 'blue' # equality between str and bytes always evaluates to False
False

Another issue when dealing with bytes and str is present when working with files that are returned using the open built-in function.在处理使用open内置函数返回的文件时,会出现处理bytesstr另一个问题。 On one hand, if you want ot read or write binary data to/from a file, always open the file using a binary mode like 'rb' or 'wb'.一方面,如果您不想从文件读取或写入二进制数据,请始终使用“rb”或“wb”等二进制模式打开文件。 On the other hand, if you want to read or write Unicode data to/from a file, be aware of the default encoding of your computer, so if necessary pass the encoding parameter to avoid surprises.另一方面,如果您想从文件中读取或写入 Unicode 数据,请注意您计算机的默认编码,以便在必要时传递encoding参数以避免意外。

In Python 2在 Python 2 中

str consists of sequences of 8-bit values, while unicode consists of sequences of Unicode characters. str由 8 位值序列组成,而unicode由 Unicode 字符序列组成。 One thing to keep in mind is that str and unicode can be used together with operators if str only consists of 7-bit ASCI characters.要记住的一件事是,如果str仅包含 7 位 ASCI 字符,则strunicode可以与运算符一起使用。

It might be useful to use helper functions to convert between str and unicode in Python 2, and between bytes and str in Python 3.在 Python 2 中使用辅助函数在strunicode之间以及在 Python 3 中在bytesstr之间进行转换可能很有用。

From What is Unicode :什么是 Unicode

Fundamentally, computers just deal with numbers.从根本上说,计算机只处理数字。 They store letters and other characters by assigning a number for each one.它们通过为每个字符分配一个数字来存储字母和其他字符。

...... ......

Unicode provides a unique number for every character, no matter what the platform, no matter what the program, no matter what the language. Unicode 为每个字符提供一个唯一编号,无论平台、程序、语言如何。

So when a computer represents a string, it finds characters stored in the computer of the string through their unique Unicode number and these figures are stored in memory.所以当计算机表示一个字符串时,它会通过其唯一的Unicode编号找到该字符串计算机中存储的字符,并将这些数字存储在内存中。 But you can't directly write the string to disk or transmit the string on network through their unique Unicode number because these figures are just simple decimal number.但是你不能直接将字符串写入磁盘或通过它们唯一的 Unicode 数字在网络上传输字符串,因为这些数字只是简单的十进制数。 You should encode the string to byte string, such as UTF-8 .您应该将字符串编码为字节字符串,例如UTF-8 UTF-8 is a character encoding capable of encoding all possible characters and it stores characters as bytes (it looks like this ). UTF-8是一种字符编码,能够对所有可能的字符进行编码,并将字符存储为字节(看起来像这样)。 So the encoded string can be used everywhere because UTF-8 is nearly supported everywhere.所以编码的字符串可以在任何地方使用,因为几乎所有地方都支持UTF-8 When you open a text file encoded in UTF-8 from other systems, your computer will decode it and display characters in it through their unique Unicode number.当您从其他系统打开以UTF-8编码的文本文件时,您的计算机将对其进行解码并通过其唯一的 Unicode 编号显示其中的字符。 When a browser receive string data encoded UTF-8 from network, it will decode the data to string (assume the browser in UTF-8 encoding) and display the string.当浏览器从网络接收到UTF-8编码的字符串数据时,它会将数据解码为字符串(假设浏览器为UTF-8编码)并显示该字符串。

In python3, you can transform string and byte string to each other:在python3中,您可以将字符串和字节字符串相互转换:

>>> print('中文'.encode('utf-8'))
b'\xe4\xb8\xad\xe6\x96\x87'
>>> print(b'\xe4\xb8\xad\xe6\x96\x87'.decode('utf-8'))
中文 

In a word, string is for displaying to humans to read on a computer and byte string is for storing to disk and data transmission.总之,字符串用于显示给人类在计算机上阅读,字节字符串用于存储到磁盘和数据传输。

Let's have a simple one-character string 'š' and encode it into a sequence of bytes:让我们有一个简单的单字符字符串'š'并将其编码为字节序列:

>>> 'š'.encode('utf-8')
b'\xc5\xa1'

For the purpose of this example let's display the sequence of bytes in its binary form:出于本示例的目的,让我们以二进制形式显示字节序列:

>>> bin(int(b'\xc5\xa1'.hex(), 16))
'0b1100010110100001'

Now it is generally not possible to decode the information back without knowing how it was encoded.现在通常不可能在不知道信息是如何编码的情况下将信息解码回来。 Only if you know that the utf-8 text encoding was used, you can follow the algorithm for decoding utf-8 and acquire the original string:只有知道使用的是utf-8文本编码,才能按照utf-8的解码算法获取原始字符串:

11000101 10100001
   ^^^^^   ^^^^^^
   00101   100001

You can display the binary number 101100001 back as a string:您可以将二进制数101100001显示为字符串:

>>> chr(int('101100001', 2))
'š'

Unicode is an agreed-upon format for the binary representation of characters and various kinds of formatting (eg lower case/upper case, new line, carriage return), and other "things" (eg emojis). Unicode 是字符的二进制表示和各种格式(例如小写/大写、换行、回车)和其他“事物”(例如表情符号)的公认格式。 A computer is no less capable of storing a unicode representation (a series of bits), whether in memory or in a file, than it is of storing an ascii representation (a different series of bits), or any other representation (series of bits).计算机存储 unicode 表示(一系列位)的能力,无论是在内存中还是在文件中,都与存储 ascii 表示(不同的位系列)或任何其他表示(位系列)的能力相同)。

For communication to take place, the parties to the communication must agree on what representation will be used.为了进行通信通信各方必须就将使用什么代表达成一致。

Because unicode seeks to represent all the possible characters (and other "things") used in inter-human and inter-computer communication, it requires a greater number of bits for the representation of many characters (or things) than other systems of representation that seek to represent a more limited set of characters/things.由于 unicode 试图表示人际和计算机间通信中使用的所有可能的字符(和其他“事物”),因此与其他表示系统相比,它需要更多的位来表示许多字符(或事物)。试图代表一组更有限的字符/事物。 To "simplify," and perhaps to accommodate historical usage, unicode representation is almost exclusively converted to some other system of representation (eg ascii) for the purpose of storing characters in files.为了“简化”,也许是为了适应历史使用,unicode 表示几乎完全转换为其他一些表示系统(例如 ascii),以便在文件中存储字符。

It is not the case that unicode cannot be used for storing characters in files, or transmitting them through any communications channel, simply that it is not.这不是的情况下的unicode不能被用于在文件中存储的字符,或通过任何通信信道发送它们,只需它不是

The term "string," is not precisely defined.术语“字符串”没有精确定义。 "String," in its common usage, refers to a set of characters/things. “字符串”在其常见用法中是指一组字符/事物。 In a computer, those characters may be stored in any one of many different bit-by-bit representations.在计算机中,这些字符可以存储在许多不同的逐位表示中的任何一种中。 A "byte string" is a set of characters stored using a representation that uses eight bits (eight bits being referred to as a byte). “字节串”是使用八位(八位称为一个字节)的表示存储的一组字符。 Since, these days, computers use the unicode system (characters represented by a variable number of bytes) to store characters in memory, and byte strings (characters represented by single bytes) to store characters to files, a conversion must be used before characters represented in memory will be moved into storage in files.由于现在计算机使用 unicode 系统(由可变字节数表示的字符)在内存中存储字符,并使用字节字符串(由单个字节表示的字符)将字符存储到文件中,因此必须在表示的字符之前使用转换在内存中将被移动到文件中的存储中。

Putting it simple, think of our natural languages like - English, Bengali, Chinese etc. While talking, all of these languages make sound.简单地说,想想我们的自然语言,如英语、孟加拉语、汉语等。在说话时,所有这些语言都会发声。 But do we understand all of them even if we here them?但是,即使我们在这里,我们是否理解所有这些? - The answer is generally no. - 答案通常是否定的。 So, if I say I understand English, it means that I know how those sounds are encoded to some meaningful English words and I just decode these sounds in the same way to understand them.所以,如果我说我懂英语,​​这意味着我知道这些声音是如何编码成一些有意义的英语单词的,我只是以同样的方式解码这些声音来理解它们。 So, same goes for any other language, if you know it you have the encoder-decoder pack for that language in your mind, again if you don't know it you just don't have this.因此,对于任何其他语言也是如此,如果您知道它,您就会想到该语言的编码器-解码器包,同样,如果您不知道它,您就是没有这个。

Same goes for digital systems.数字系统也是如此。 Just like ourselves, as we can only listen sounds with our ears and make sound with mouth, computers can only store bytes and read bytes.就像我们自己一样,我们只能用耳朵听声音,用嘴发声,所以计算机只能存储字节和读取字节。 So, the certain application knows how to read bytes and interpret them (like how many bytes to consider to understand any information) and also write in the same way such that its fellow applications also understand it.因此,某个应用程序知道如何读取字节并解释它们(例如要考虑多少字节才能理解任何信息),并且还以相同的方式写入,以便其他应用程序也能理解它。 But without the understanding (encoder-decoder) all data written to a disk are just strings of bytes.但是没有理解(编码器-解码器),所有写入磁盘的数据都只是字节串。

A string is a bunch of items strung together.字符串是一堆串在一起的项目。 A byte string is a sequence of bytes, like b'\xce\xb1\xce\xac' which represents "αά" .字节串是一个字节序列,例如b'\xce\xb1\xce\xac'代表"αά" A character string is a bunch of characters, like "αά" .字符串是一堆字符,例如"αά" Synonymous to a sequence.序列的同义词。

A byte string can be directly stored to the disk directly, while a string (character string) cannot be directly stored on the disk.字节串可以直接存入磁盘,而字符串(字符串)不能直接存入磁盘。 The mapping between them is an encoding.它们之间的映射是一种编码。

The Python languages includes str and bytes as standard "Built-in Types". Python 语言包括strbytes作为标准的“内置类型”。 In other words, they are both classes.换句话说,它们都是类。 I don't think it's worthwhile trying to rationalize why Python has been implemented this way.我认为尝试合理解释 Python 以这种方式实现的原因是不值得的。

Having said that, str and bytes are very similar to one another.话虽如此, strbytes彼此非常相似。 Both share most of the same methods.两者共享大部分相同的方法。 The following methods are unique to the str class:以下方法是str类独有的:

casefold
encode
format
format_map
isdecimal
isidentifier
isnumeric
isprintable

The following methods are unique to the bytes class:以下方法是bytes类独有的:

decode
fromhex
hex

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 字符串文字和字符串值有什么区别? - What is the difference between string literals and string values? 二进制字符串,字节字符串,unicode字符串和普通字符串(str)之间的区别 - Difference between binary string, byte string, unicode string and an ordinary string (str) render 和 render_to_string 有什么区别? - What is difference between render & render_to_string? print和格式化的字符串文字之间有什么区别? - What is the difference between print and formatted string literals? python中的r&#39;string&#39;和普通的&#39;string&#39;有什么区别? - What's the difference between r'string' and normal 'string' in python? (在Python中)list(string).reverse()和list(string)[::-1]有什么区别? - (In Python) What's the difference between list(string).reverse() and list(string)[::-1]? 将字符串和字符串列表提供给 keras 标记器有什么区别? - What is the difference between giving a string and a list of string(s) to keras tokenizer? 普通字符串和以&#39;%s&#39;格式化的字符串有什么区别? - What's the difference between a normal string and a string formatted by '%s'? “二进制码”和“字节码”有什么区别? - What is the difference between "binary code" and "byte code"? Python 字符串格式中的 %s 和 %d 有什么区别? - What's the difference between %s and %d in Python string formatting?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM