简体   繁体   中英

General question about Binary files

I am a beginner and I am having trouble in grasping binary files. When I write to a file in binary mode (in python), I just write normal text. There is nothing binary about it. I know every file on my computer is a binary file but I am having trouble distinguishing between files written in binary mode by me and files like audio, video etc files that show up as gibberish if I open them in a text editor.

How are files that show up as gibberish created? Can you please give an example of a small file that is created like this, preferably in python?

I have a feeling I am asking a really stupid question but I just had to ask it. Googling around didn't help me.

Here's a literal answer to your question:

import struct
with open('gibberish.bin', 'wb') as f:
    f.write(struct.pack('<4d', 3.14159, 42.0, 123.456, 987.654))

That's packing those 4 floating point numbers into a binary format (little-endian IEEE 756 64-bit floating point).

Here's (some of) what you need to know:

Reading and writing a file in binary mode incurs no transformation on the data that you read or write. In text mode, as well as any decoding/encoding to/from Unicode, the data that you read or write is transformed according to the platform conventions for "text files".

Unix/Linux/Mac OS X: no change

older Mac: line separator is \\r , changed to/from Python standard \\n

Windows: line separator is \\r\\n , changed to/from \\n . Also (little known fact), Ctrl-Z aka \\x1a is interpreted as end-of-file, a convention inherited from CP/M which recorded file sizes as the number of 128-byte sectors used.

When I write to a file in binary mode (in python), I just write normal text.

You'll have to change your approach when you upgrade to Python 3.x:

>>> f = open(filename, 'wb')
>>> f.write("Hello, world!\n")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: must be bytes or buffer, not str
>>> f.write(b"Hello, world!\n")
14

But your question isn't really about binary files. It's about str .

In Python 2.x, str is a byte sequence that has an overloaded meaning:

  • A non-Unicode string, or
  • Raw binary data (like pixels in an image).

If you print the latter as it were the former, you get gibberish.

Python 3.x got rid of this double meaning by introducing a separate bytes type for binary data, leaving str unambiguously as a text string (and making it Unicode).

So-called "text" files are simply files that follow certain conventions: the bytes are usually a subset of all the possible bytes, generally ASCII or Unicode values, and are organized into "lines" with "line terminators". The standard line terminators vary by platform - Unix uses \\n , Mac \\r , and Windows \\r\\n - so part of the convention is to translate these on the fly. This works fine with text files, but will clobber other kinds of files, because an 0x0a ( \\n ) byte in a sound file or something won't take well to being converted to 0x0d 0x0a ( \\r\\n ). Of course, if you've only been using Unix, this won't have come up.

In Python 3, all strings are Unicode, and opening a file as text means you have to read and write Unicode strings, and perhaps specify an encoding (it defaults to UTF-8). Opening a file as binary means you have to use bytes objects, which are simple lists of 8-bit bytes and don't get encoded.

Does this clarify things?

Binary files are normally created when you try to encode objects. For example, you might have a Person object with properties like Name, Age, Height. If you were to write this file as text so that it can be read back in later, you might output something like this:

Name:Ralph
Age:25
Height:5'6"

But you can represent it more compactly in binary. In binary, you might just output the name, age and height one right after the other, and you'd have to read them back in in the exact same order because you no longer have these delimiters. In that case, your string would have to encoded with something like Ralph\\0 . The \\0 is the null character so that it knows where the string ends.

The 25 can be represented as just 2 characters in text/ASCII but if you tried putting two numbers side-by-side, like 25 and 26, you'd get 2526 and you wouldn't know where one ends and the next begins. These numbers are actually integers and be represented by 4 bytes. When you write a file as binary, you'd write out all 4 bytes, even if the left-most bits are all 0. That way it always knows exactly how much to read it. And so forth...

That's why "binary files" look like jibberish, because they've got all this extra information in them.

To generate these files, you'd have to encode or "pack" your data like John Machin suggests.

Maybe your are sending string in your binary file and your computer can decode it and show it to you? Try to write a file with random byte. Or you could show us your code so we can understand the problem.

I recommend using the codecs module of Python for writing text files (it allows you to set the related charset/encoding). For writing binary file use the standard file() method. On windows you may need use 'wb' or 'rb' for binary modes (does not matter on Unix).

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM