Register width and parsing for a fast-loading file format

Question

For the past approx. 20 years I've been working on a program for 3D graphics that implements a METAFONT-like language. It's in C++. I now have started working on a format and functions for writing the data for the 3D objects to a binary file and then reading them in again. It is intended for saving and fast-loading data that has been calculated in order to avoid calculating it again each time the program is run.

The syntax for the file format is intended to be for a machine-like language that allows for the highest possible efficiency without having to worry about being comfortable for people to read or write.

My question relates to the way data is read into registers: The architecture of my computer is x86_64, so obviously I have 64-bit registers. Does it pay at all to read data into objects smaller than 64 bit, ie, chars, ints or floats? Isn't anything that's read read into a 64-bit register? As I understand it, any unused bits of a register are set to 0, which is an extra step, so less efficient than just reading a long int or a double in the first place. Is this correct and does anyone have any suggestion on how I should proceed?

This is what I tried in response to Scheff's Cat's comment.

/* ttemp.c  */
#include <stdlib.h>
#include <stdio.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <unistd.h>

void
write_uint(unsigned int i);

void
write_ulong(unsigned long int li);

int fd = 0;

int
main(int argc, char *argv[])
{
   printf("Entering ttemp.\n");
   fd = open("ttemp.output", O_WRONLY | O_CREAT, S_IRWXU);
   printf("fd == %d\n", fd);
   write_uint(~0U);
   write_ulong(~0UL);   
   close(fd);
   printf("Exiting ttemp.\n");
   return 0;
}

void
write_uint(unsigned int i)
{
   write(fd, &i, 4);
   return;
}

void
write_ulong(unsigned long int li)
{
   write(fd, &li, 8);
   return;
}

Then I ran:

gcc -pg -o ttemp ttemp.c
ttemp
gprof ttemp

This is the contents of ttemp.output, according to Emacs in Hexl mode, so the objects were obviously written to the output file:

00000000: ffff ffff ffff ffff ffff ffff            ............

This was the relevant portion of the output of gprof:

Call graph (explanation follows)
granularity: each sample hit covers 2 byte(s) no time propagated
index % time    self  children    called     name
                0.00    0.00       1/1           main [8]
[1]      0.0    0.00    0.00       1         write_uint [1]
-----------------------------------------------
                0.00    0.00       1/1           main [8]
[2]      0.0    0.00    0.00       1         write_ulong [2]
-----------------------------------------------

So, not very illuminating. My guess is that the nulling in the registers is performed at the level of the processor and any time it takes won't show up on the system call level. However, I'm not a systems programmer and my grasp of these topics isn't particularly firm.

Answer 1

Does it pay at all to read data into objects smaller than 64 bit, ie, chars, ints or floats?

This is dependent of the architecture. On most platform this is very cheap, like 1 cycle if not even free regarding the exact target code. For more information about this, please read Should I keep using unsigned ints in the age of 64-bit computers? . Note that float-double conversion can be significantly slower but it is still a mater of dozens of cycles on most mainstream x86 platform (is can be very slow on embedded devices though).

Isn't anything that's read read into a 64-bit register?

Actually, the processor does not read files per block of 64-bits. Nearly all IO operations are buffered (otherwise they would be very very slow due to the high-latency of storage devices and even system calls). For example, the system can fetch a buffer of 256 KiB when you request only 4 bytes because it knows that application often reads files contiguously and also because most storage device are optimized for contiguous operations (the number of IO operation per second is generally small). For more information about the the latency of IO operations compared to other ones, please read this (not the number are approximations). Put it shortly, the latency of an IO operation is far bigger than the one of a type cast so the later should be completely negligible on most platforms (at least all mainstream ones). And even though the read/write are buffered, the cost of a function call that read/write from/into an internal buffer is still higher than a cast. Thus, you should not care much about that in such a case.

Register width and parsing for a fast-loading file format

Question

1 answers

solution1
2 ACCPTED 2022-06-14 18:31:27

Register width and parsing for a fast-loading file format

Question

1 answers

solution1 2 ACCPTED 2022-06-14 18:31:27

solution1
2 ACCPTED 2022-06-14 18:31:27