简体   繁体   English

什么是 Python 字节串?

[英]What is a Python bytestring?

What's a Python bytestring?什么是 Python 字节串?

All I can find are topics on how to encode to bytestring or decode to ascii or utf-8 .我所能找到的只是关于如何编码为字节串或解码为asciiutf-8 I'm trying to understand how it works under the hood.我试图了解它是如何工作的。 In a normal ASCII string, it's an array or list of characters, and each character represents an ASCII value from 0-255, so that's how you know what character is represented by the number.在普通的 ASCII 字符串中,它是一个数组或字符列表,每个字符代表一个 0-255 之间的 ASCII 值,因此您知道数字代表什么字符。 In Unicode, it's the 8- or 16-byte representation for the character that tells you what character it is.在 Unicode 中,它是字符的 8 或 16 字节表示,它告诉您它是什么字符。

So what is a bytestring?那么什么是字节串呢? How does Python know which characters to represent as what? Python 如何知道将哪些字符表示为什么? How does it work under the hood?它是如何在引擎盖下工作的? Since you can print or even return these strings and it shows you the string representation, I don't quite get it...由于您可以打印甚至返回这些字符串并且它向您显示字符串表示形式,因此我不太明白......

Ok, so my point is definitely getting missed here.好的,所以我的观点肯定在这里被忽略了。 I've been told that it's an immutable sequence of bytes without any particular interpretation .有人告诉我,这是一个不可变的字节序列,没有任何特定的解释

A sequence of bytes.. Okay, let's say one byte:一个字节序列.. 好的,让我们说一个字节:
'a'.encode() returns b'a' . 'a'.encode()返回b'a'

Simple enough.足够简单。 Why can I read the a ?为什么我可以读取a

Say I get the ASCII value for a , by doing this:假设我通过这样做得到a的 ASCII 值:
printf "%d" "'a"

It returns 97 .它返回97 Okay, good, the integer value for the ASCII character a .好的,好的,ASCII 字符a的整数值。 If we interpret 97 as ASCII, say in a C char , then we get the letter a .如果我们将97解释为 ASCII,比如在 C char ,那么我们得到字母a Fair enough.很公平。 If we convert the byte representation to bits, we get this:如果我们将字节表示转换为位,我们会得到:

01100001

2^0 + 2^5 + 2^6 = 97 . 2^0 + 2^5 + 2^6 = 97 Cool.凉爽的。

So why is 'a'.encode() returning b'a' instead of 01100001 ??那么为什么'a'.encode()返回b'a'而不是01100001
If it's without a particular interpretation , shouldn't it be returning something like b'01100001' ?如果它没有特定的解释,它不应该返回类似b'01100001'东西吗?
It seems like it's interpreting it like ASCII.似乎它像 ASCII 一样解释它。

Someone mentioned that it's calling __repr__ on the bytestring, so it's displayed in human-readable form.有人提到它在字节__repr__上调用__repr__ ,因此它以人类可读的形式显示。 However, even if I do something like:但是,即使我执行以下操作:

with open('testbytestring.txt', 'wb') as f:
    f.write(b'helloworld')

It will still insert helloworld as a regular string into the file, not as a sequence of bytes... So is a bytestring in ASCII?仍然会将helloworld作为常规字符串插入到文件中,而不是作为字节序列......那么 ASCII 中的字节串是什么?

It is a common misconception that text is ascii or utf8 or cp1252, and therefore bytes are text.一个常见的误解是文本是 ascii 或 utf8 或 cp1252,因此字节是文本。

Text is only text, in the way that images are only images.文本只是文本,就像图像只是图像一样。 The matter of storing text or images to disk is a matter of encoding that data into a sequence of bytes.将文本或图像存储到磁盘的问题是将该数据编码为字节序列的问题。 There are many ways to encode images into bytes: Jpeg, png, svg, and likewise many ways to encode text, ascii, utf8 or cp1252.将图像编码为字节的方法有很多种:Jpeg、png、svg,同样也有很多方法可以对文本、ascii、utf8 或 cp1252 进行编码。

Once encoding has happened, bytes are just bytes.一旦编码发生,字节就只是字节。 Bytes are not images anymore, they have forgotten the colors they mean;字节不再是图像,它们已经忘记了它们所代表的颜色; although an image format decoder can recover that information.尽管图像格式解码器可以恢复该信息。 Bytes have similarly forgotten the letters they used to be.字节也同样忘记了它们曾经是的字母。 In fact, bytes don't remember wether they were images or text at all.事实上,字节根本不记得它们是图像还是文本。 Only out of band knowledge (filename, media headers, etcetera) can guess what those bytes should mean, and even that can be wrong (in case of data corruption)只有带外知识(文件名、媒体标头等)才能猜测这些字节的含义,甚至可能是错误的(以防数据损坏)

so, in python (py3), we have two types for things that might otherwise look similar;所以,在 python (py3) 中,我们有两种类型的东西可能看起来很相似; For text, we have str , which knows it's text;对于文本,我们有str ,它知道它是文本; it knows which letters it's supposed to mean.它知道它应该表示哪些字母。 It doesn't know which bytes that might be, since letters are not bytes.它不知道可能是哪些字节,因为字母不是字节。 We also have bytestring , which doesn't know if it's text or images or any other kind of data.我们还有bytestring ,它不知道它是文本还是图像或任何其他类型的数据。

The two types are superficially similar, since they are both sequences of things, but the things that they are sequences of is quite different.这两种类型表面上相似,因为它们都是事物的序列,但它们作为序列的事物却大不相同。

Implementationally, str is stored in memory as UCS-?在实现上, strUCS-? where the ?在哪里? is implementation defined, it may be UCS4, UCS2 or UCS1, depending on compile time options and which codepoints are present in the represented string.是实现定义的,它可能是 UCS4、UCS2 或 UCS1,具体取决于编译时选项和表示的字符串中存在哪些代码点。


edit "but why"?编辑“但为什么”?

Some things that look like text are actually defined in other terms.一些看起来像文本的东西实际上是用其他术语定义的。 A really good example of this are the many internet protocols of the world.世界上许多互联网协议就是一个很好的例子。 For instance, HTTP is a "text" protocol that is in fact defined using the ABNF syntax common in RFC's.例如,HTTP 是一种“文本”协议,实际上是使用 RFC 中常见的 ABNF 语法定义的。 These protocols are expressed in terms of octets, not characters, although an informal encoding may also be suggested:这些协议用八位字节表示,而不是字符,尽管也可以建议使用非正式编码:

2.3. 2.3. Terminal Values终端价值

Rules resolve into a string of terminal values, sometimes called规则解析为一串终端值,有时称为
characters.人物。 In ABNF, a character is merely a non-negative integer.在 ABNF 中,字符只是一个非负整数。
In certain contexts, a specific mapping (encoding) of values into a在某些情况下,值的特定映射(编码)到
character set (such as ASCII) will be specified.字符集(如 ASCII)将被指定。

This distinction is important, because it's not possible to send text over the internet, the only thing you can do is send bytes.这种区别很重要,因为无法通过 Internet 发送文本,您唯一能做的就是发送字节。 saying "text but in 'foo' encoding" makes the format that much more complex, since clients and servers need to now somehow figure out the encoding business on their own, hopefully in the same way, since they must ultimately pass data around as bytes anyway.说“text but in 'foo' encoding”使格式变得更加复杂,因为客户端和服务器现在需要以某种方式自己弄清楚编码业务,希望以相同的方式,因为他们最终必须以字节形式传递数据反正。 This is doubly useless since these protocols are seldom about text handling anyway, and is only a convenience for implementers.这是双重无用的,因为无论如何这些协议很少涉及文本处理,并且只是为实现者提供便利。 Neither the server owners nor end users are ever interested in reading the words Transfer-Encoding: chunked , so long as both the server and the browser understand it correctly.服务器所有者和最终用户都没有兴趣阅读Transfer-Encoding: chunked ,只要服务器和浏览器都正确理解它。

By comparison, when working with text, you don't really care how it's encoded.相比之下,在处理文本时,您并不真正关心它是如何编码的。 You can express the "Heävy Mëtal Ümlaüts" any way you like, except "Heδvy Mλtal άmlaόts"除了“Heδvy Mλtal άmlaόts”之外,您可以以任何喜欢的方式表达“Heävy Mëtal Ümlaüts”


the distinct types thus give you a way to say "this value 'means' text" or "bytes".因此,不同的类型为您提供了一种表达“此值'表示'文本”或“字节”的方法。

Python does not know how to represent a bytestring. Python知道如何表示字节串。 That's the point.这才是重点。

When you output a character with value 97 into pretty much any output window, you'll get the character 'a' but that's not part of the implementation;当您将值为 97 的字符输出到几乎所有输出窗口时,您将获得字符 'a' 但这不是实现的一部分; it's just a thing that happens to be locally true.这只是在当地发生的事情。 If you want an encoding, you don't use bytestring.如果你想要一个编码,你不使用字节串。 If you use bytestring, you don't have an encoding.如果使用字节串,则没有编码。

Your piece about .txt files shows you have misunderstood what is happening.您关于 .txt 文件的文章表明您误解了正在发生的事情。 You see, plain text files too don't have an encoding.你看,纯文本文件也没有编码。 They're just a series of bytes.它们只是一系列字节。 These bytes get translated into letters by the text editor but there is no guarantee at all that someone else opening your file will see the same thing as you if you stray outside the common set of ASCII characters.这些字节得到通过文本编辑器翻译成字母但不能保证所有别人打开你的文件会看到同样的事情,你如果你外面流浪一套共同的ASCII字符。

As the name implies, a Python3 bytestring (or simply a str in Python 2.7) is a string of bytes .顾名思义,Python3 的bytestring串(或在 Python 2.7 中只是简单的str )是一串bytes And, as others have pointed out, it is immutable.而且,正如其他人指出的那样,它是不可变的。

It is distinct from a Python3 str (or, more descriptively, a unicode in Python 2.7) which is a string of abstract unicode characters (aka UTF-32, though Python3 adds fancy compression under the hood to reduce the actual memory footprint similar to UTF-8, perhaps even in a more general way).它不同于 Python3 str (或者更详细地说,Python 2.7 中的unicode ),后者是一串抽象的unicode 字符(又名 UTF-32,尽管 Python3 在幕后添加了花哨的压缩以减少类似于 UTF 的实际内存占用) -8,甚至可能以更一般的方式)。

There are essentially three ways of "interpreting" these bytes.基本上有三种“解释”这些字节的方法。 You can look at the numeric value of an element, like this:您可以查看元素的数值,如下所示:

>>> ord(b'Hello'[0])  # Python 2.7 str
72
>>> b'Hello'[0]  # Python 3 bytestring
72

Or you can tell Python to emit one or more elements to the terminal (or a file, device, socket, etc.) as 8-bit characters , like this:或者,您可以告诉 Python 将一个或多个元素作为 8 位字符发送到终端(或文件、设备、套接字等),如下所示:

>>> print b'Hello'[0] # Python 2.7 str
H
>>> import sys
>>> sys.stdout.buffer.write(b'Hello'[0:1]) and None; print() # Python 3 bytestring
H

As Jack hinted, in this latter case it is your terminal interpreting the character, not Python.正如杰克暗示的那样,在后一种情况下,是您的终端解释字符,而不是 Python。

Finally, as you have seen in your own research, you can also get Python to interpret a bytestring .最后,正如您在自己的研究中所见,您还可以使用Python来解释bytestring For example, you can construct an abstract unicode object like this in Python 2.7:例如,您可以在 Python 2.7 中构造一个像这样的抽象unicode对象:

>>> u1234 = unicode(b'\xe1\x88\xb4', 'utf-8')
>>> print u1234.encode('utf-8') # if terminal supports UTF-8
ሴ
>>> u1234
u'\u1234'
>>> print ('%04x' % ord(u1234))
1234
>>> type(u1234)
<type 'unicode'>
>>> len(u1234)
1
>>> 

Or like this in Python 3:或者像这样在 Python 3 中:

>>> u1234 = str(b'\xe1\x88\xb4', 'utf-8')
>>> print (u1234) # if terminal supports UTF-8 AND python auto-infers
ሴ
>>> u1234.encode('unicode-escape')
b'\\u1234'
>>> print ('%04x' % ord(u1234))
1234
>>> type(u1234)
<class 'str'>
>>> len(u1234)
1

(and I am sure that the amount of syntax churn between Python2.7 and Python3 around bystestring, strings, and unicode had something to do with the continued popularity of Python2.7. I suppose that when Python3 was invented they didn't yet realize that everything would become UTF-8 and therefore all the fuss about abstraction was unnecessary) (而且我确信 Python2.7 和 Python3 之间围绕字节串、字符串和 unicode 的大量语法变动与 Python2.7 的持续流行有关。我想当 Python3 被发明时他们还没有意识到一切都将变成 UTF-8,因此所有关于抽象的大惊小怪都是不必要的)

But unicode abstraction does not happen automatically if you don't want it to.但是,如果您不想,unicode 抽象不会自动发生。 The point of a bytestring is that you can directly get at the bytes. bytestring是您可以直接获取字节。 Even if your string happens to be a UTF-8 sequence, you can still access bytes in the sequence:即使您的字符串恰好是 UTF-8 序列,您仍然可以访问序列中的字节:

>>> len(b'\xe1\x88\xb4')
3
>>> b'\xe1\x88\xb4'[0]
'\xe1'

and this works in both Python2.7 and Python3, with the difference being that in Python2.7 you have str , while in Python3 you have bytestring .而这个作品既Python2.7和Python3,与不同之处在于在Python2.7你有str ,而在Python3已bytestring

You can also do other wonderful things with bytestring s, like knowing if they will fit in a reserved space within a file, sending them directly over a socket, calculating the HTTP content-length field correctly, and avoiding Python Bug 8260 .您还可以使用bytestring做其他美妙的事情,例如了解它们是否适合文件中的保留空间、直接通过套接字发送它们、正确计算 HTTP content-length字段以及避免Python Bug 8260 In short, use bytestring s when your data is processed and stored in bytes.简而言之,当您的数据以字节为单位进行处理和存储时,请使用bytestring

Bytes objects are immutable sequences of single bytes.字节对象是不可变的单字节序列。 The docs have a very good explanation of what they are and how to use them.文档对它们是什么以及如何使用它们有很好的解释。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM