简体   繁体   中英

What do files actually contain, and how are they "read"? What is a "format" and why should I worry about them?

As it becomes ever easier to use computers in general and get into programming in particular, an increasing fraction of beginners seem to lack certain fundamental understanding that was once taken for granted in programming circles. Meanwhile as technology advances, the details of that understanding have grown more complex (I personally was programming before Unicode existed, let alone, say, JSON or XML). So, for the sake of having a solid reference, it seems apropos to ask:

What exactly is in a file, anyway? What do we mean when we say that we "open" and "read" a file - what are we getting out of it? I know the term "data", but just giving a name to something is not a real explanation.

More importantly, how do we make sense of data? If I try simply reading some data from a file and outputting it to the console, why does it often look like garbage? Why do some other files appear to have some text scattered among that garbage, while yet others seem to be mostly or completely text? Why isn't it sufficient to ask the program to read, say, an image file, in order to display the image? Again, I know the term "format", but this doesn't explain the concept. If we say, for example, that we make sense of data according to its format, then that only raises two more questions - how do we determine the format, and how does it actually help?

Data, bits and bytes

Everyone who has had to buy hardware, or arrange a.network connection, should have some familiarity with the concept of a "bit" and of a "byte". They're used to measure the capacity of storage devices and transfer rates. In short, they measure data : the amount of data that can be stored on a disk, or the amount of data transferred along a cable (or via a wireless connection) per second.

Data is essentially information - a record of some kind of knowledge. The bit is the fundamental unit of information , representing the smallest possible amount of knowledge: the answer to a yes-or-no question, a choice between two options, a record of a decision between two alternatives. (There would need to be at least two possibilities; with only one, there was no answering, choice or decision necessary, and thus nothing is learned by seeing that single possibility arise.)

A byte is simply a grouping of bits in a standard size . Almost everyone nowadays defines a byte to mean 8 bits, mainly because all contemporary consumer hardware is designed around that concept. In some very specific technical contexts (such as certain C or C++ language standard documents), "byte" may have a broader meaning , and octet is used to be precise about 8-bit groupings. We will stick with "byte" here, because we don't need to worry about ancient hardware or idiosyncratic compiler implementations for now.

Data storage devices - both permanent ones like HDDs and SSDs, and temporary ones like RAM - use a huge amount of individual components (depending on the device) to represent data, each of which can conceptually be in either of two states (we commonly use "on or off", "1 or 0" etc. as metaphors). Because there's a decision to be made between those two states, the component thus represents one bit of data. The data isn't a physical thing - it's not the component itself. It's the state of that component: the answer to the question "which of the two possible ways is this component configured right now?".

How data is made useful

It's clear to see how we can use a bit to represent a number, if there are only two possible numbers we are interested in. Suppose those numbers are 0 and 1; then we can ask, "is the number 1?", and according to the bit that tells us the answer to that question, we know which number is represented.

It turns out that in fact this is all we need in order to represent all kinds of numbers. For example, if we need to represent a number from {0, 1, 2, 3} , we can use two bits: one that tells us whether the represented number is in {0, 1} or {2, 3} , and one that tells us whether it's in {0, 2} or {1, 3} . If we can answer those two questions, we can identify the number. This technique generalizes, using base two arithmetic , to represent any integer: essentially, each bit corresponds to a value from the geometric sequence 1, 2, 4, 8, 16... , and then we just add up (implicitly) the values that were chosen by the bits. By tweaking this convention slightly , we can represent negative integers as well. If we let some bits correspond to binary fractions as well ( 1/2, 1/4, 1/8... ), we can approximate real numbers (including the rationals) as closely as we want, depending on how many bits we use for the fractional part. Alternately, we can just use separate groups of bits to represent the numerator and denominator of a rational number - or, for that matter, the real and imaginary parts of a complex number.

Furthermore, once we can represent numbers, we can represent all kinds of answers to questions. For example, we can agree on a sequence of symbols that are used in text; and then, implicitly, a number represents the symbol at that position in the sequence. So we can use some amount of bits to represent a symbol; and by representing individual symbols repeatedly, we can represent text.

Similarly, we can represent the height of a sound wave at a given instant in time; by repeating this process a few tens of thousands of times per second , we can represent sound audible to humans.

Similarly, having studied how the human eye works , we find that we can analyze colours as combinations of three intensity values (ie, numbers) representing "components" of the colour. By describing colours at many points a small distance apart (like with the sound wave, but in a two-dimensional grid ), we can represent images. By considering images across time ( a few tens of times per second ), we can represent animations.

And so on, and so on.

Choosing an interpretation

There's a problem, here, though. All of this simply talks about possibilities for what data could represent. How do we know what it does represent?

Plainly, the raw data stored by a computer doesn't inherently represent anything specific . Because it's all in the same regular, sequence-of-bits form, nothing stops us from taking any arbitrary chunk of data and interpreting it by any of the schemes described above.

It just... isn't likely to appear like anything meaningful, that way.

However, the choice of interpretations is a choice ... which means it can be encoded and recorded in this raw-data form. We say that such data is metadata : data that tells us about the meaning of other data. This could take many forms: the names of our files and the folder structure (telling us how those files relate to each other, and how the user intends to keep track of them); extensions on file names, special data at the beginning of files or other notes made within the file system (telling us what type of file it is, corresponding to a file format - keep reading); documentation (something that humans can read in order to understand how another file is intended to work); and computer programs (data which tells the computer what steps to take, in order to present the file's contents to the user).

What is a (file) format?

Quite simply, a format is the set of rules that describes a way to interpret some data (typically, the contents of a file). When we say that a file is "in" a particular format, we mean that it a) has a valid interpretation according to that format (not every possible chunk of data will meet the requirements, in general) and b) is intended to be interpreted that way.

Put another way: a format is the meaning represented by some metadata .

A format can be a subset or refinement of some other format. For example, JSON documents are also text documents, using UTF-8 encoding. The JSON format adds additional meaning to the text that was represented, by describing how specific text sequences are used to represent structured data. A programming language can also be thought of as this kind of format: it gives additional meaning to text, by explaining how that text can be translated into instructions a computer can follow. (A computer's "machine code" is also a kind of format, that gets interpreted directly by the hardware rather than by a program.)

(Recall: we established that a computer program can be a kind of metadata, and that a programming language can be a kind of format, and that metadata represents a format. To close the loop: of course, one can have a computer program that implements a programming language - that's what a compiler is.)

A format can also involve multiple steps, explained by separate standards. For example, Unicode is the de facto standard text format, but it only describes how abstract numbers correspond to text symbols. It doesn't directly say how to convert the bits into numbers (and this does need to be specified ; "treat each byte as a number from 0..255" a) would still be making a choice of many possible ways to do it; b) isn't really sufficient, because there are a lot more possible text symbols than that). To represent text, we also need an encoding , ie the rest of the rules for the data format, specifically to convert bits to numbers. UTF-8 is one such encoding , and has become dominant .

What actually happens when we read the file?

Raw data is transferred from the file on disk, into the program's memory.

That's it.

Some languages offer convenience functionality, for the common case of treating the data like text. This might mean doing some light processing on the data (because operating systems disagree about which text symbols, in what order represent "the end of a line"), and loading the data into the language's built-in "string" data structure, using some kind of encoding. (Yes, even if the encoding is "each byte represents a number from 0 to 255 inclusive, which represents the corresponding Unicode code point", that is an encoding - even if it doesn't represent all text and thus isn't a proper Unicode encoding - and it is being used even if the programmer did nothing to specify it; there is no such thing as "plain text" , and ignoring this can have all kinds of strange consequences .)

But fundamentally, the reading is really just a transfer of data. Text conversion is often treated as special because, for a long time, programmers were sloppy about treating text properly as an interpretation of data; for decades there was an interpretation of data as text - one byte per text symbol (incidentally, "character" does not mean the same thing as a Unicode code point) - so well established that everyone started forgetting they were actually using it. Programmers forgot about this even though it only actually specifies what half the possible values of a byte mean and leaves the other half up to a local interpretation , and even though that scheme is still woefully inadequate for many world languages, such that programmers in many other countries came up with their own solutions . The solution - the Unicode standard, mentioned several times above - had its first release in 1991 , but there are still a few programmers today blithely ignoring it.

But enough ranting.

How does interpreting a file work?

In order to display an image, render a web page, play sound or anything else from a file, we need to :

  • Have data that is actually intended to represent the corresponding thing;
  • Know the format that is used by the data to represent the thing;
  • Load the data (read the file, or read data from a.network connection, or create the data by some other process);
  • Process the data according to the format.

This happens for even the simplest cases, and it can involve multiple programs. For example, a simple command-line program that inputs text from the user (from the "standard input stream") and outputs text back (to the "standard output stream"), generally, is not actually causing the text to appear on screen , or figuring out what keys were pressed on the keyboard. Instead: the operating system interprets signals from the keyboard, in order to create readable data; after the program writes out its response to the input, another program (the terminal) will translate the text into pixel colour values (getting help from the operating system to choose images from a font); then the operating system will arrange to send the appropriate data to the monitor (according to the terminal window's position etc.).

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM