简体   繁体   中英

Changing buffer size to copy file in C

I have created a function that creates a copy of a file: read --> buffer --> write. I'm trying to increase the buffer size several times and see how if affects the time it takes to copy the file (about 50Mb)

# include <assert.h>
# include <stdio.h>
# include <stdlib.h>
# include <unistd.h>
# include <sys/types.h>
# include <sys/stat.h>
# include <sys/wait.h>
# include <string.h>
# include <fcntl.h>
# include <time.h>
// Copy the file referred to by in to out 
void copy (int in, int out, char *buffer, long long taille) {
  int t;

  while ((t = read(in, &buffer, sizeof taille))> 0)
    write (out, &buffer, t);


  if (t < 0)
    perror("read");
}

int main(){
  
  clock_t timing;  //to time 
  int buffer_size = 1;
  char * buffer = NULL;
  
  // allocating memory for the buffer
  buffer = malloc(sizeof(char)*buffer_size);
  // test mémoire
  if (!buffer) {
    perror("malloc ini");
    exit(1);
  }

  // temporary buffer to be able to increase the siwe of the buffer 
  char * temp_buffer = NULL;

  // opening the files
  int fichier1 = open("grosfichier",O_RDONLY);
  int fichier2 = open("grosfichier_copy", O_WRONLY|O_CREAT);
  
  for (int i=0; buffer_size <= 1048576; i++){
    
    temp_buffer = realloc(buffer, buffer_size * sizeof(char));
    if(!temp_buffer) {
      perror("malloc temp_buffer");
      exit(1);
    }
    
    buffer = temp_buffer;

    timing = clock();
    copy(fichier1,fichier2, buffer, buffer_size); //recopie l'entree std dans la sortie std
    timing = clock() - timing;

    printf("%d, buffer size = %d, time : %ld\n", i, buffer_size, timing);
    remove("grosfichier_copie");

    buffer_size *= 2;
  }
  // free(temp_buffer);
  free(buffer);
  close(fichier1);
  close(fichier2);

  return 0;
}

The code runs and copies the file, but the timing thing doesn't thing to work properly

0, buffer size = 1, time : 6298363
1, buffer size = 2, time : 1
2, buffer size = 4, time : 1
3, buffer size = 8, time : 1
4, buffer size = 16, time : 1
5, buffer size = 32, time : 1
6, buffer size = 64, time : 1
7, buffer size = 128, time : 1
8, buffer size = 256, time : 1
9, buffer size = 512, time : 1
10, buffer size = 1024, time : 1
11, buffer size = 2048, time : 1
12, buffer size = 4096, time : 1
13, buffer size = 8192, time : 1
14, buffer size = 16384, time : 1
15, buffer size = 32768, time : 0
16, buffer size = 65536, time : 1
17, buffer size = 131072, time : 4
18, buffer size = 262144, time : 1
19, buffer size = 524288, time : 2
20, buffer size = 1048576, time : 2
[Finished in 6.5s]
  1. Why doesn't it seem to copy after the first run? (according to the timing?)
  2. Am I using free appropriately? (I tried moving it in the loop, but it doesn't run)
  3. Am I passing the buffer appropriately to the function copy?

Thanks!

EDIT1: Thank you for all your comments, I have corrected the major flaws related to opening and closing the files within the loop, using the buffer appropriately. and the types of variables as suggested: I'm getting results that are much more logical:

0, buffer size = 1, time : 8069679
1, buffer size = 2, time : 4082421
2, buffer size = 4, time : 2041673
3, buffer size = 8, time : 1020645
4, buffer size = 16, time : 514176
...

but I'm till struggling with handling write() errors appropriately.

Edit2: is this version of copy fine?

void copy (int in, int out, char *buffer, size_t taille) {
  ssize_t t;

  while ((t = read(in, buffer, taille))> 0){
    if (write (out, buffer, t)<0){
      perror("error writing");
    }
  }

  if (t < 0)
    perror("read");
}

Why doesn't it seem to copy after the file run? (according to the timing?)

Lots of possibilities. Firstly there are problems with your code. You don't seem to be rewinding or reopening the file to copy. After the first iteration, you are at end of file, so the remaining iterations copy 0 bytes.

Secondly, there are OS factors to consider. In particular, general purpose operating systems maintain an in memory cache of recently used disk contents. This means that the first time you read a file, it has to be pulled off disk, but on subsequent occasions, it may be already in RAM.

Am I using free appropriately? (I tried moving it in the loop, but it doesn't run)

Yes. Realloc will either reuse the same memory block if it is big enough or it will malloc a new block, copy the old block and free the old block. So do not ever attempt to realloc a block you have already freed.

Am I passing the buffer appropriately to the function copy?

Yes, but you are not using it appropriately within the function copy() as detailed by the comments you are receiving. Some of the problems within copy() are:

  • buffer is already a char* so do not take its address to pass to read() .
  • taille is the length of buffer so pass it directly to read . Passingf sizeof taille passes the size of the variable itself, not its content.
  • write need not necessarily write all the bytes in the buffer in one go. In that case, it will return a short count (unlikely to be an issue for a disk file).
  • write can also return -1 for an error. You need to handle that error.

In your main program there are also issues.

  • As stated above: you either need to close and reopen the input file or rewind it to the beginning on each iteration of the loop.
  • remove does not do what you think, it merely removes the directory entry and decrements the file's reference count. The file will only physically go away when its reference count reaches zero. It won't reach zero while you still have an open file descriptor to it. So, you also need to close and reopen the output file or you'll just continue appending to an anonymous file that will be automatically deleted when your process exits.
  • One I didn't spot before: you should declare taille and buffer_size as size_t because that is the right sized type for the arguments to realloc , read (and write ). t should, however, be an ssize_t (signed size) because it can return either -1 or the count of bytes read/written.

Here's my modified version of your code, addressing most of the issues that I raised in comments, and most of those that other people raised.

# include <stdio.h>
# include <stdlib.h>
# include <unistd.h>
# include <fcntl.h>
# include <time.h>

size_t copy(int in, int out, char *buffer, size_t taille);

size_t copy(int in, int out, char *buffer, size_t taille)
{
    ssize_t t;
    ssize_t bytes = 0;

    while ((t = read(in, buffer, taille)) > 0)
    {
        if (write(out, buffer, t) != t)
            return 0;
        bytes += t;
    }

    if (t < 0)
        perror("read");
    return bytes;
}

int main(void)
{
    clock_t timing;
    int buffer_size = 1;
    char *buffer = malloc(sizeof(char) * buffer_size);

    if (!buffer)
    {
        perror("malloc ini");
        exit(1);
    }

    int fichier1 = open("grosfichier", O_RDONLY);
    if (fichier1 < 0)
    {
        perror("grosfichier");
        exit(1);
    }

    for (int i = 0; buffer_size <= 1048576; i++)
    {
        lseek(fichier1, 0L, SEEK_SET);
        char *temp_buffer = realloc(buffer, buffer_size * sizeof(char));
        if (!temp_buffer)
        {
            perror("malloc temp_buffer");
            exit(1);
        }
        int fichier2 = open("grosfichier_copy", O_WRONLY | O_CREAT, 0644);
        if (fichier2 < 0)
        {
            perror("open copy file");
            exit(1);
        }

        buffer = temp_buffer;

        timing = clock();
        size_t copied = copy(fichier1, fichier2, buffer, buffer_size);
        timing = clock() - timing;

        printf("%d, buffer size = %9d, time : %8ld (copied %zu bytes)\n",
               i, buffer_size, timing, copied);
        close(fichier2);
        remove("grosfichier_copie");

        buffer_size *= 2;
    }
    free(buffer);
    close(fichier1);

    return 0;
}

When I ran it (with two timing commands giving times), I got:

2018-01-15 08:00:27 [PID 43372] copy43
0, buffer size =         1, time : 278480098 (copied 50000000 bytes)
1, buffer size =         2, time : 106462932 (copied 50000000 bytes)
2, buffer size =         4, time : 53933508 (copied 50000000 bytes)
3, buffer size =         8, time : 27316467 (copied 50000000 bytes)
4, buffer size =        16, time : 13451731 (copied 50000000 bytes)
5, buffer size =        32, time :  6697516 (copied 50000000 bytes)
6, buffer size =        64, time :  3459170 (copied 50000000 bytes)
7, buffer size =       128, time :  1683163 (copied 50000000 bytes)
8, buffer size =       256, time :   882365 (copied 50000000 bytes)
9, buffer size =       512, time :   457335 (copied 50000000 bytes)
10, buffer size =      1024, time :   240605 (copied 50000000 bytes)
11, buffer size =      2048, time :   126771 (copied 50000000 bytes)
12, buffer size =      4096, time :    70834 (copied 50000000 bytes)
13, buffer size =      8192, time :    46279 (copied 50000000 bytes)
14, buffer size =     16384, time :    35227 (copied 50000000 bytes)
15, buffer size =     32768, time :    27996 (copied 50000000 bytes)
16, buffer size =     65536, time :    28486 (copied 50000000 bytes)
17, buffer size =    131072, time :    24203 (copied 50000000 bytes)
18, buffer size =    262144, time :    26015 (copied 50000000 bytes)
19, buffer size =    524288, time :    19484 (copied 50000000 bytes)
20, buffer size =   1048576, time :    28851 (copied 50000000 bytes)
2018-01-15 08:08:47 [PID 43372; status 0x0000]  -  8m 19s

real    8m19.351s
user    1m21.231s
sys 6m52.312s

As you can see, the 1-byte copying was dramatically awful and took something like 4 minutes of wall clock time to copy the data. Using 2 bytes halved that; 4 bytes halved it again, and the improvements kept going until about 32 KiB. After that, the performance was steady — and fast (the last few lines appeared in what seemed like under a second each, but I wasn't paying close attention). I'd put in alternative wall-clock timing using clock_gettime() (or gettimeofday() if that's not available) to time each cycle. I was worried at first with the lack of progress on the single byte copying, but a second terminal window confirmed the copy was growing, but oh so slowly!

It's been a while since this thread was active, but I though I'd add to Andrew Henle's post.

To get a better idea of the real time involved in copying files, one could add an fsync(2) after the forever-loop exits and before copy() returns. fsync(2) will make sure all the data in the systems buffers has been sent to the underlying storage device. Note, however, that most disk drives have an onboard cache that can buffer writes, again, masking the actual time it takes to write to the media.

The vast majority of code that I write is for safety critical systems. Those are systems that, if they malfunction, can cause serious injury or death, or serious environmental damage. Such systems can be found in modern aircraft, nuclear power plants, medical devices, and automobile computers, just to name a few.

One of the rules applying to source code for safety critical systems is that loops must have a clear condition to break out of the loop. By "clear", the break condition must be expressed in the for , while , or do-while , and not somewhere within the compound statement.

I understand exactly what Andrew wrote. The intent is clear. It's concise. There's nothing wrong with it. And it's an excellent suggestion.

But (here's the "but"), the condition in the for appears at first glance to be infinite:

for (;; ) {... }

Why is this important? Source code validators would flag this as an infinite loop. Then you get dinged on your performance review, you don't get the raise you were expecting, your wife gets mad at you, files for a divorce, takes everything you own, and takes off with your divorce lawyer. And THAT's why it's important.

I'd like to suggest an alternate structure:

 void copy( int in, int out, char *buffer, size_t bufsize ) { ssize_t bytes_read; switch(1) do { ssize_t bytes_written; bytes_written = write( out, buffer, bytes_read ); if ( bytes_written:= bytes_read ) { // error handling code } default. // Loop entry point is here, bytes_read = read( in, buffer; bufsize ); } while (bytes_read > 0 ); fsync(out); }
I first ran across a switch-loop structure like this in the mid-80's. It was an effort to optimize the use of a pipelined architecture by avoiding departures from the execution of sequential instructions.

Suppose you had a simple routine that had to do a few things a great number of times. Copying data from one buffer to another is a perfect example.

 char *srcp, *dstp; // source and destination pointers int count; // number of bytes to copy (must be > 0)... while (count--) { *dstp++ = *srcp++; }...

Simple enough. Right?

Downside: Every iteration around the loop, the processor has to jump back to the start of the loop, and in doing so, it dumps whatever is in the prefetch pipeline.

Using a technique called "loop unrolling", this can be rewritten to take advantage of a pipeline:

 char *srcp, *dstp; // source and destination pointers int count; // number of bytes to copy (must be > 0)... switch (count % 8) do { case 0: *dstp++ = *srcp++; --count; case 7: *dstp++ = *srcp++; --count; case 6: *dstp++ = *srcp++; --count; case 5: *dstp++ = *srcp++; --count; case 4: *dstp++ = *srcp++; --count; case 3: *dstp++ = *srcp++; --count; case 2: *dstp++ = *srcp++; --count; case 1: *dstp++ = *srcp++; --count; } while (count > 0); ...

Follow it through. The first statement executed is the switch . It takes the low three bits of count and jumps to the appropriate case label. Each case copies the data, increments the pointers, and decrements the count, then falls through to the next case .

When it gets to the bottom, the while condition is evaluated, and, if true, continues execution at the top of the do..while . It does not re-execute the switch .

The advantage is that the machine code produced is a longer series of sequential instructions, and therefore executes fewer jumps taking greater advantage of a pipelined architecture.

As noted in the comments, this code is wrong:

void copy (int in, int out, char *buffer, long long taille) {
  int t;

  while ((t = read(in, &buffer, sizeof taille))> 0)
    write (out, &buffer, t);


  if (t < 0)
    perror("read");
}

First, a minor issue: both read() and write() return ssize_t , not int .

Second, you're ignoring the return value from write() , so you never really know how much gets written. This may or may not be a problem in your code, but you won't detect a failed copy from a filled-up filesystem, for example.

Now, for the real problems.

read(in, &buffer, sizeof taille)

&buffer is wrong. buffer is a char * - a variable in memory containing the address of a char buffer. That's telling read() to put the data it reads from the in file descriptor in the memory occupied by the buffer pointer variable itself, and not the actual memory that the address held in the buffer pointer variable refers to. You simply need buffer .

sizeof taille is also wrong. That's the size of the taille variable itself - as a long long it's likely 8 bytes.

If you're trying to copy the entirety of a file:

void copy( int in, int out, char *buffer, size_t bufsize )
{
    // why stuff three or four operations into
    // the conditional part of a while()??
    for ( ;; )
    {
        ssize_t bytes_read = read( in, buffer, bufsize );
        if ( bytes_read <= 0 )
        {
            break;
        }

        ssize_t bytes_written = write( out, buffer, bytes_read );
        if ( bytes_written != bytes_read )
        {
            // error handling code
        }
    }
 }

It's that simple. The hard part is the error handling for any possible failure.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM