简体   繁体   中英

Why Binary file is not a Text file and all Text files are Binary files?

What is the deciding factor for classifying a file into Binary or Text file?

Eg: Consider the below C program

  1. Create file in binary mode
  2. Write two integers into file "binary.txt".

NOTE: Before running the program make sure binary.txt doesnt exist.

Observation:

File created "binary.txt" with contents TEXTFILE

#include <stdio.h>

int main()
{
   int arr[2] = {1415071060,1162627398};
   FILE *fp = fopen("binary.txt", "wb");

   if(fp == NULL)
   {
       printf("Error opening file\n");
       exit(1);
   }
   fwrite(arr, sizeof(arr), 1, fp);
   fclose(fp);
   return 0;
}

However only creator knows that it is created in binary mode and this should be called binary file.

Anyone who opens the file "binary.txt" think its text file.

What a general user should call this file - Binary or Text file?

@JohnBollinger summarized it best in a comment.

text vs. binary is not a fundamental file characteristic on modern operating systems, but rather a differentiation between how files are interpreted .

Let's say a file contains four bytes with the following hex values of the bytes:

0x41 0x42 0x43 0x44

If you interpret those bytes as characters in a system that uses ASCII encoding, you will get the characters ABCD .

If you treat those bytes as a 4-byte integer, you will get the value 0x41424344 (1094861636 in decimal) in a big endian system and 0x44434241 (1145258561 in decimal) in a little endian system.

As far as the computer is concerned, it's all binary. As to what they mean, it's all a matter of intepretation.

On modern operating systems, there is no distinction at the file system level between text files and binary files. On legacy systems, the C library implements a series of tricks to translate newlines between OS specific representations (such as 0x0D 0x0A ) and the single byte representation ' \\n' for the C program reading the file in text mode . This compatibility layer must not be used when dealing with actual binary contents, for which the b option must be used in fopen() .

Older operating systems used to have different representations for text and binary files, but most of these are obsolete nowadays.

Conversely, many file systems keep track of executable files with some specific information such as mode bits on Unix FS. These executable files can be binary, containing one form or another of executable code, while others are text files containing scripts.

In your example, whether the file should be seen as binary or text is a matter of intent. If the creator of the file intended for is to be read as binary, naming it binary.txt is confusing as the filename extension .txt is routinely used to indicate generic text files. sample.bin would be much more obvious.

How to interpret the contents of a file is important for programmers and casual users: on legacy systems, loading and save a file as text may change its contents, unless you use tools that are terminally anal about preserving contents.

For example qemacs , a programmer's editor inspired by emacs, makes extensive efforts upon loading a file to determine the best mode for displaying and editing the contents:

  • binary vs: text mode (defaulting to hex display for binary)
  • line termination convention
  • character encoding
  • programming language or other specific content sensitive display options...

If the file is written back without modifications, the contents are preserved so binary files that happen to have textual contents are unmodified. Otherwise, the above tests determine the correct conventions for encoding new contents.

This question has changed substantially since it was first posed. In particular, the term "executable" has been removed from the discussion.

Current question:

Only creator knows that it is created in binary mode and this should be called binary file.

The creator has not only created the file but also made it available. If the purpose and format was not communicated then that is a failure somewhere.

Anyone who opens the file "binary.txt" think it's text file.

People would probably think so, but they still can't properly process it as a text file without knowing the character encoding. Again, a communications failure. A guessed-at character encoding that works today might not work for the contents of the file tomorrow.


Answer to original question:

Yes, it's all a matter of interpretation. Interpretation requires context and metadata.

In addition to what others have said,

  • A file cannot be text unless you know which character encoding was used to write it (and must be used to read it). Common file systems do not store this knowledge. People dealing in text files must pass this essential metadata on to programs and other people.

  • A file cannot be executable unless you know which interpreter program or program loader to load it with. Systems have schemes for this:

    • Unix-like: Set the eXecutable flag on the file in the file system. Then if it is a "script" and an interpreter program is required for it, the script can have a shebang #! line stating the program to run it in.
    • Windows: Use a particular file extension listed in the PATHEXT environment variable. Example: PATHEXT=.COM;.EXE;.BAT;.CMD;.VBS;.VBE;.JS;.JSE;.WSF;.WSH;.MSC The extension would be registered with an Open verb indicating how to "open" or start it.
    • Finally, the program could have a "file signature" that indicates which program loader to run it with.
  • A file can be called binary whether or not you have metadata to call it text or executable or both.

I think one has to distinguish "text", "binary", and "executable":

"Text" usually means a file containing only human readable characters (alpha + numeric + tabs and cr/lf), ie something that you can open with a text editor without seeing weird stuff.

The meaning of "binary" often depends on the context. If the context is, for example, the open mode used in file processing, then "binary" means that each byte is read in as is, whereas "text" means that platform specific conversions like automatically converting a "\\r\\n" into a single "\\n" apply (cf., for example, FILE *fp=fopen("c:\\\\test.txt", "rb") versus FILE *fp=fopen("c:\\\\test.txt", "rt") ). If the context is the distribution format of programs, then "binary" often means "precompiled for a particular platform". This is in contrast to source code distributions, where the files are typically "text files".

The meaning of "executable" is that the file content is interpreted by the operating system as an executable program. This often means a file containing machine code instructions, which contain non-readable characters as well, such that they are usually not "text files", and they are usually not interpreted as text. In a broader sense, also shell scripts are "executables", as they contain instructions interpreted by the respective shell. These instructions are written as text and can be opened in a text editor.

From these perspectives, I think that "text" and "binary" are opposite terms, whereas "executable" is orthogonal to both.

I think you are asking two different questions.

File contents

If the file contains textual data, ie, lines of characters delimited by newlines, then it is a text file.

Otherwise it is presumed to contain data in some form other than strictly character data, such as binary integers, floating-point numbers, image pixels, music samples, structured binary data, etc., which means that it is a binary file, ie, a non-text file.

There are many other text file formats, such as .xml , .html , .csv , as well as programming language source code files. These are strictly character text files, but generally have some kind of internal structure based on the syntax of their contents.

That being said, all text files are inherently binary files, in the sense that the characters, newlines, and so forth comprising the textual data in the file are nothing more that a stream of bytes at the lowest level.

File name

Specifically, the filename extension or suffix . By convention, files with a .txt extension are presumed to contain text data, ie, lines of character data delimited by some kind of newline sequences.

A different filename extension like .bin or .exe (or a hundred others) indicate some kind of binary data file, usually structured in some way. By convention, .bin indicates binary data with no specific format, ie, just a stream of bytes.

In addition, there are files having an extension like .doc or .pdf (or dozens of others), indicating a word processing document file. These files also contain character text data, but it is typically stored in some kind of strictly binary format that is specific to the word processing software used to create it.

In general, a file is just a sequence of bytes.

For any machine you're likely to use, bytes are 8 bits. So each byte has 256 possible values.

Confining our attention for the moment to old-fashioned ASCII. something like 95 of those bytes are ordinary, printing characters: letters, digits, punctuation. There are a few more characters which may also appear in text files: let's say tab, carriage return, linefeed, and form feed ( '\\t' , '\\r' , '\\n' , and '\\f' ).

If every one of the bytes in a file is one of those printing characters, the file is a text file.

If any of the bytes in a file is other than one of those printing characters, the file is not a text file.

If the file is intended for human consumption, its creator will have used only the ordinary printing characters, and it will be a text file.

If the file contains arbitrary data, each byte might have any of its 256 possible values, and the file will be a binary file. It's very likely that at least one of the bytes in such a file will be something other than an ordinary printing character. (Even if all the arbitrary bytes just happen to be in the set of ordinary printable characters, they probably won't mean much, and we might still think of it as a binary file.)

Anyway, that's why every text file is theoretically a binary file, but not every binary file is a text file.

As a practical example, try this program:

#include <stdio.h>

int main()
{
    short int x = 906;
    FILE *fp1 = fopen("textfile.txt", "w");
    FILE *fp2 = fopen("binaryfile.bin", "wb");
    if(fp1 == NULL || fp2 == NULL) exit(1);
    fprintf(fp1, "%d\n", x);
    fwrite(&x, sizeof(x), 1, fp2);
    fclose(fp1);
    fclose(fp2);
}

If you compile and run this program, you should find that it creates a text file textfile.txt containing the string 12345 . But if you inspect the file binaryfile.bin , you should find that it contains just two bytes, with the hexadecimal values 03 and 8A . Neither of those is an ordinary printing character, so it's a binary file.

Now, try changing the program slightly, setting

short int x = 12345;

If you run it again, textfile.txt will now contain the string 12345 , as expected. binaryfile.bin will again contain two bytes, this time with hex values 30 and 39 . But if you try printing binaryfile.bin , you'll probably see the characters 0 and 9 , because 0x30 and 0x39 are the ASCII codes for the characters 0 and 9 .

NOTE: Limiting our discussion to ASCII (multi-byte charsets, other Encodings are set aside to avoid unnecessary confusion)


Let us understand the difference between string and array of characters

In a byte of 8 bits , we can store 0 to 255 if unsigned , -128 to +127 if signed

As a whole, if we see a byte ( 8 bits ) the value that can be fit into it is -128 to 255 (range). The range of ASCII characters ( 0 to 127 ).

Given character array a[10] if any of the bytes a[0] to a[9] has value out of the range of ASCII character range then it is not a string , its just array of characters. If all of the bytes fall within the ASCII range ( 0 to 127 ) then it is string .

In summary for the array of characters, the range can be any of ( -128 to 255 ).

Important conclusion here is since ASCII range ( 0 to 127 ) is a proper subset of -128 to 255 All strings can be called the array of characters.

Now let us apply the above definition to binary file vs text file .

If in a file all bytes are in the range of ASCII ( 0 to 127 ) it should be called a text file.

If any of them falls out of this range ie any of (-128 to -1 ) or ( 128 to 255 ) then it is a binary file.

In summary, since ASCII range 0 to 127 is a proper subset of ( -128 to 255 ) all text files are binary files .

If a file has atleast one byte from ( -128 to -1 ) or ( 128 to 255 ) it cannot be text file only binary file .

I have not verified standards if any of ASCII range character(s) has special treatment. But in summary I think I made the distinction behind text file vs binary file clear.

Hope this helps

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM