简体   繁体   English

关于二进制文件的一般问题

[英]General question about Binary files

I am a beginner and I am having trouble in grasping binary files. 我是初学者,我在抓取二进制文件时遇到了麻烦。 When I write to a file in binary mode (in python), I just write normal text. 当我以二进制模式(在python中)写入文件时,我只写正常文本。 There is nothing binary about it. 关于它没有任何二进制文件。 I know every file on my computer is a binary file but I am having trouble distinguishing between files written in binary mode by me and files like audio, video etc files that show up as gibberish if I open them in a text editor. 我知道我的计算机上的每个文件都是二进制文件,但是我无法区分我用二进制模式写的文件和音频,视频等文件,如果我在文本编辑器中打开它们就会显示为乱码。

How are files that show up as gibberish created? 文件如何显示为乱码? Can you please give an example of a small file that is created like this, preferably in python? 你能举个像这样创建的小文件的例子,最好是在python中吗?

I have a feeling I am asking a really stupid question but I just had to ask it. 我有一种感觉,我问的是一个非常愚蠢的问题,但我不得不问它。 Googling around didn't help me. 谷歌搜索并没有帮助我。

Here's a literal answer to your question: 以下是您问题的字面答案:

import struct
with open('gibberish.bin', 'wb') as f:
    f.write(struct.pack('<4d', 3.14159, 42.0, 123.456, 987.654))

That's packing those 4 floating point numbers into a binary format (little-endian IEEE 756 64-bit floating point). 这就是将这4个浮点数打包成二进制格式(little-endian IEEE 756 64位浮点数)。

Here's (some of) what you need to know: 这是(一些)你需要知道的:

Reading and writing a file in binary mode incurs no transformation on the data that you read or write. 以二进制模式读取和写入文件不会对您读取或写入的数据进行转换。 In text mode, as well as any decoding/encoding to/from Unicode, the data that you read or write is transformed according to the platform conventions for "text files". 在文本模式中,以及与Unicode之间的任何解码/编码,您读取或写入的数据将根据“文本文件”的平台约定进行转换。

Unix/Linux/Mac OS X: no change Unix / Linux / Mac OS X:没有变化

older Mac: line separator is \\r , changed to/from Python standard \\n 旧的Mac:行分隔符是\\r \\n ,更改为/从Python标准\\n

Windows: line separator is \\r\\n , changed to/from \\n . Windows:行分隔符是\\r\\n ,更改为/从\\n Also (little known fact), Ctrl-Z aka \\x1a is interpreted as end-of-file, a convention inherited from CP/M which recorded file sizes as the number of 128-byte sectors used. 另外(鲜为人知的事实),Ctrl-Z aka \\x1a被解释为文件结束,这是一种从CP/M继承的约定,它将文件大小记录为所使用的128字节扇区的数量。

When I write to a file in binary mode (in python), I just write normal text. 当我以二进制模式(在python中)写入文件时,我只写正常文本。

You'll have to change your approach when you upgrade to Python 3.x: 升级到Python 3.x时,您必须更改方法:

>>> f = open(filename, 'wb')
>>> f.write("Hello, world!\n")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: must be bytes or buffer, not str
>>> f.write(b"Hello, world!\n")
14

But your question isn't really about binary files. 但你的问题并不是关于二进制文件。 It's about str . 这是关于str

In Python 2.x, str is a byte sequence that has an overloaded meaning: 在Python 2.x中, str是一个具有重载含义的字节序列:

  • A non-Unicode string, or 非Unicode字符串,或
  • Raw binary data (like pixels in an image). 原始二进制数据(如图像中的像素)。

If you print the latter as it were the former, you get gibberish. 如果你打印后者就像前者一样,你会得到胡言乱语。

Python 3.x got rid of this double meaning by introducing a separate bytes type for binary data, leaving str unambiguously as a text string (and making it Unicode). Python 3.x通过为二进制数据引入单独的bytes类型来消除这种双重含义,将str明确地保留为文本字符串(并使其成为Unicode)。

So-called "text" files are simply files that follow certain conventions: the bytes are usually a subset of all the possible bytes, generally ASCII or Unicode values, and are organized into "lines" with "line terminators". 所谓的“文本”文件只是遵循某些约定的文件:字节通常是所有可能字节的子集,通常是ASCII或Unicode值,并且被组织成带有“行终止符”的“行”。 The standard line terminators vary by platform - Unix uses \\n , Mac \\r , and Windows \\r\\n - so part of the convention is to translate these on the fly. 标准行终止符因平台而异 - Unix使用\\n ,Mac \\r \\n和Windows \\r\\n - 因此常规的一部分是动态翻译它们。 This works fine with text files, but will clobber other kinds of files, because an 0x0a ( \\n ) byte in a sound file or something won't take well to being converted to 0x0d 0x0a ( \\r\\n ). 这适用于文本文件,但会破坏其他类型的文件,因为声音文件中的0x0a\\n )字节或其他东西不能很好地转换为0x0d 0x0a\\r\\n )。 Of course, if you've only been using Unix, this won't have come up. 当然,如果你只使用Unix,那就不会出现了。

In Python 3, all strings are Unicode, and opening a file as text means you have to read and write Unicode strings, and perhaps specify an encoding (it defaults to UTF-8). 在Python 3中,所有字符串都是Unicode,并且将文件作为文本打开意味着您必须读取和写入Unicode字符串,并且可能指定编码(默认为UTF-8)。 Opening a file as binary means you have to use bytes objects, which are simple lists of 8-bit bytes and don't get encoded. 将文件打开为二进制意味着您必须使用bytes对象,这些对象是8位字节的简单列表,不会被编码。

Does this clarify things? 这澄清了什么吗?

Binary files are normally created when you try to encode objects. 通常在尝试编码对象时创建二进制文件。 For example, you might have a Person object with properties like Name, Age, Height. 例如,您可能有一个Person对象,其属性包括Name,Age,Height。 If you were to write this file as text so that it can be read back in later, you might output something like this: 如果您要将此文件写为文本以便以后可以回读,则可能会输出如下内容:

Name:Ralph
Age:25
Height:5'6"

But you can represent it more compactly in binary. 但是你可以用二进制来更紧凑地表示它。 In binary, you might just output the name, age and height one right after the other, and you'd have to read them back in in the exact same order because you no longer have these delimiters. 在二进制文件中,您可能只是一个接一个地输出名称,年龄和高度,并且您必须以完全相同的顺序读回它们,因为您不再具有这些分隔符。 In that case, your string would have to encoded with something like Ralph\\0 . 在这种情况下,你的字符串必须用Ralph\\0编码。 The \\0 is the null character so that it knows where the string ends. \\0是空字符,因此它知道字符串结束的位置。

The 25 can be represented as just 2 characters in text/ASCII but if you tried putting two numbers side-by-side, like 25 and 26, you'd get 2526 and you wouldn't know where one ends and the next begins. 25可以用文本/ ASCII中的两个字符表示,但是如果你尝试并排放置两个数字,比如25和26,你就会得到2526,你不会知道哪一个结束而下一个结束。 These numbers are actually integers and be represented by 4 bytes. 这些数字实际上是整数,由4个字节表示。 When you write a file as binary, you'd write out all 4 bytes, even if the left-most bits are all 0. That way it always knows exactly how much to read it. 当你把文件写成二进制文件时,你会写出所有4个字节,即使最左边的位都是0.这样它总能确切知道读取多少。 And so forth... 等等...

That's why "binary files" look like jibberish, because they've got all this extra information in them. 这就是为什么“二进制文件”看起来像乱码,因为他们已经获得了所有这些额外的信息。

To generate these files, you'd have to encode or "pack" your data like John Machin suggests. 要生成这些文件,您必须像John Machin建议的那样对数据进行编码或“打包”。

Maybe your are sending string in your binary file and your computer can decode it and show it to you? 也许你在二进制文件中发送字符串,你的电脑可以解码并显示给你? Try to write a file with random byte. 尝试用随机字节写一个文件。 Or you could show us your code so we can understand the problem. 或者您可以向我们展示您的代码,以便我们了解问题。

I recommend using the codecs module of Python for writing text files (it allows you to set the related charset/encoding). 我建议使用Python的编解码器模块来编写文本文件(它允许您设置相关的字符集/编码)。 For writing binary file use the standard file() method. 对于编写二进制文件,请使用标准file()方法。 On windows you may need use 'wb' or 'rb' for binary modes (does not matter on Unix). 在Windows上,你可能需要使用'wb'或'rb'来表示二进制模式(在Unix上无关紧要)。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM