简体   繁体   English

更改缓冲区大小以复制 C 中的文件

[英]Changing buffer size to copy file in C

I have created a function that creates a copy of a file: read --> buffer --> write.我创建了一个 function 来创建一个文件的副本:读取 --> 缓冲区 --> 写入。 I'm trying to increase the buffer size several times and see how if affects the time it takes to copy the file (about 50Mb)我试图多次增加缓冲区大小,看看是否会影响复制文件所需的时间(大约 50Mb)

# include <assert.h>
# include <stdio.h>
# include <stdlib.h>
# include <unistd.h>
# include <sys/types.h>
# include <sys/stat.h>
# include <sys/wait.h>
# include <string.h>
# include <fcntl.h>
# include <time.h>
// Copy the file referred to by in to out 
void copy (int in, int out, char *buffer, long long taille) {
  int t;

  while ((t = read(in, &buffer, sizeof taille))> 0)
    write (out, &buffer, t);


  if (t < 0)
    perror("read");
}

int main(){
  
  clock_t timing;  //to time 
  int buffer_size = 1;
  char * buffer = NULL;
  
  // allocating memory for the buffer
  buffer = malloc(sizeof(char)*buffer_size);
  // test mémoire
  if (!buffer) {
    perror("malloc ini");
    exit(1);
  }

  // temporary buffer to be able to increase the siwe of the buffer 
  char * temp_buffer = NULL;

  // opening the files
  int fichier1 = open("grosfichier",O_RDONLY);
  int fichier2 = open("grosfichier_copy", O_WRONLY|O_CREAT);
  
  for (int i=0; buffer_size <= 1048576; i++){
    
    temp_buffer = realloc(buffer, buffer_size * sizeof(char));
    if(!temp_buffer) {
      perror("malloc temp_buffer");
      exit(1);
    }
    
    buffer = temp_buffer;

    timing = clock();
    copy(fichier1,fichier2, buffer, buffer_size); //recopie l'entree std dans la sortie std
    timing = clock() - timing;

    printf("%d, buffer size = %d, time : %ld\n", i, buffer_size, timing);
    remove("grosfichier_copie");

    buffer_size *= 2;
  }
  // free(temp_buffer);
  free(buffer);
  close(fichier1);
  close(fichier2);

  return 0;
}

The code runs and copies the file, but the timing thing doesn't thing to work properly代码运行并复制文件,但时间问题无法正常工作

0, buffer size = 1, time : 6298363
1, buffer size = 2, time : 1
2, buffer size = 4, time : 1
3, buffer size = 8, time : 1
4, buffer size = 16, time : 1
5, buffer size = 32, time : 1
6, buffer size = 64, time : 1
7, buffer size = 128, time : 1
8, buffer size = 256, time : 1
9, buffer size = 512, time : 1
10, buffer size = 1024, time : 1
11, buffer size = 2048, time : 1
12, buffer size = 4096, time : 1
13, buffer size = 8192, time : 1
14, buffer size = 16384, time : 1
15, buffer size = 32768, time : 0
16, buffer size = 65536, time : 1
17, buffer size = 131072, time : 4
18, buffer size = 262144, time : 1
19, buffer size = 524288, time : 2
20, buffer size = 1048576, time : 2
[Finished in 6.5s]
  1. Why doesn't it seem to copy after the first run?为什么它在第一次运行后似乎没有复制? (according to the timing?) (根据时间?)
  2. Am I using free appropriately?我是否适当地使用免费? (I tried moving it in the loop, but it doesn't run) (我试着在循环中移动它,但它没有运行)
  3. Am I passing the buffer appropriately to the function copy?我是否将缓冲区适当地传递给 function 副本?

Thanks!谢谢!

EDIT1: Thank you for all your comments, I have corrected the major flaws related to opening and closing the files within the loop, using the buffer appropriately. EDIT1:感谢您的所有评论,我已经纠正了与循环内打开和关闭文件相关的主要缺陷,适当地使用了缓冲区。 and the types of variables as suggested: I'm getting results that are much more logical:以及建议的变量类型:我得到的结果更合乎逻辑:

0, buffer size = 1, time : 8069679
1, buffer size = 2, time : 4082421
2, buffer size = 4, time : 2041673
3, buffer size = 8, time : 1020645
4, buffer size = 16, time : 514176
...

but I'm till struggling with handling write() errors appropriately.但我一直在努力正确处理 write() 错误。

Edit2: is this version of copy fine? Edit2:这个版本的副本好吗?

void copy (int in, int out, char *buffer, size_t taille) {
  ssize_t t;

  while ((t = read(in, buffer, taille))> 0){
    if (write (out, buffer, t)<0){
      perror("error writing");
    }
  }

  if (t < 0)
    perror("read");
}

Why doesn't it seem to copy after the file run? 为什么文件运行后似乎没有复制? (according to the timing?) (根据时间?)

Lots of possibilities. 很多可能性。 Firstly there are problems with your code. 首先,您的代码存在问题。 You don't seem to be rewinding or reopening the file to copy. 您似乎没有倒带或重新打开要复制的文件。 After the first iteration, you are at end of file, so the remaining iterations copy 0 bytes. 第一次迭代后,您位于文件末尾,因此其余的迭代将复制0个字节。

Secondly, there are OS factors to consider. 其次,要考虑操作系统因素。 In particular, general purpose operating systems maintain an in memory cache of recently used disk contents. 特别是,通用操作系统维护最近使用的磁盘内容的内存缓存。 This means that the first time you read a file, it has to be pulled off disk, but on subsequent occasions, it may be already in RAM. 这意味着您第一次读取文件时,必须将其从磁盘中拉出,但是在随后的情况下,它可能已经在RAM中。

Am I using free appropriately? 我会适当地使用免费吗? (I tried moving it in the loop, but it doesn't run) (我尝试将其循环移动,但没有运行)

Yes. 是。 Realloc will either reuse the same memory block if it is big enough or it will malloc a new block, copy the old block and free the old block. 如果内存足够大,Realloc将重用同一内存块,或者重新分配一个新块,复制旧块并释放旧块。 So do not ever attempt to realloc a block you have already freed. 因此,请勿尝试重新分配已释放的块。

Am I passing the buffer appropriately to the function copy? 我是否将缓冲区适当地传递给函数副本?

Yes, but you are not using it appropriately within the function copy() as detailed by the comments you are receiving. 是的,但是您在接收到的注释所详细说明的函数copy()中未正确使用它。 Some of the problems within copy() are: copy()中的一些问题是:

  • buffer is already a char* so do not take its address to pass to read() . buffer已经是一个char*因此请勿将其地址传递给read()
  • taille is the length of buffer so pass it directly to read . taillebuffer的长度,因此请将其直接传递给read Passingf sizeof taille passes the size of the variable itself, not its content. 传递给尾部的sizeof taille会传递变量本身的大小,而不是其内容。
  • write need not necessarily write all the bytes in the buffer in one go. write不一定需要一次性将所有字节写入缓冲区。 In that case, it will return a short count (unlikely to be an issue for a disk file). 在这种情况下,它将返回一个较短的计数(不太可能是磁盘文件的问题)。
  • write can also return -1 for an error. write也可以为错误返回-1。 You need to handle that error. 您需要处理该错误。

In your main program there are also issues. 在您的主程序中也存在问题。

  • As stated above: you either need to close and reopen the input file or rewind it to the beginning on each iteration of the loop. 如上所述:您需要关闭并重新打开输入文件,或者在循环的每次迭代中将其倒带到开头。
  • remove does not do what you think, it merely removes the directory entry and decrements the file's reference count. remove并没有您所想的那样,它只是删除目录条目并减少文件的引用计数。 The file will only physically go away when its reference count reaches zero. 该文件仅在其引用计数达到零时才物理消失。 It won't reach zero while you still have an open file descriptor to it. 当您仍然有一个打开的文件描述符时,它不会达到零。 So, you also need to close and reopen the output file or you'll just continue appending to an anonymous file that will be automatically deleted when your process exits. 因此,您还需要关闭并重新打开输出文件,否则您将继续添加到匿名文件中,该文件将在进程退出时自动删除。
  • One I didn't spot before: you should declare taille and buffer_size as size_t because that is the right sized type for the arguments to realloc , read (and write ). 我以前没有发现的一种:您应该将taillebuffer_size声明为size_t因为对于reallocread (和write )参数来说,这是正确的大小类型。 t should, however, be an ssize_t (signed size) because it can return either -1 or the count of bytes read/written. 但是, t应该是ssize_t (带符号的大小),因为它可以返回-1或读取/写入的字节数。

Here's my modified version of your code, addressing most of the issues that I raised in comments, and most of those that other people raised. 这是我对代码的修改后的版本,解决了我在注释中提出的大多数问题以及其他人提出的大多数问题。

# include <stdio.h>
# include <stdlib.h>
# include <unistd.h>
# include <fcntl.h>
# include <time.h>

size_t copy(int in, int out, char *buffer, size_t taille);

size_t copy(int in, int out, char *buffer, size_t taille)
{
    ssize_t t;
    ssize_t bytes = 0;

    while ((t = read(in, buffer, taille)) > 0)
    {
        if (write(out, buffer, t) != t)
            return 0;
        bytes += t;
    }

    if (t < 0)
        perror("read");
    return bytes;
}

int main(void)
{
    clock_t timing;
    int buffer_size = 1;
    char *buffer = malloc(sizeof(char) * buffer_size);

    if (!buffer)
    {
        perror("malloc ini");
        exit(1);
    }

    int fichier1 = open("grosfichier", O_RDONLY);
    if (fichier1 < 0)
    {
        perror("grosfichier");
        exit(1);
    }

    for (int i = 0; buffer_size <= 1048576; i++)
    {
        lseek(fichier1, 0L, SEEK_SET);
        char *temp_buffer = realloc(buffer, buffer_size * sizeof(char));
        if (!temp_buffer)
        {
            perror("malloc temp_buffer");
            exit(1);
        }
        int fichier2 = open("grosfichier_copy", O_WRONLY | O_CREAT, 0644);
        if (fichier2 < 0)
        {
            perror("open copy file");
            exit(1);
        }

        buffer = temp_buffer;

        timing = clock();
        size_t copied = copy(fichier1, fichier2, buffer, buffer_size);
        timing = clock() - timing;

        printf("%d, buffer size = %9d, time : %8ld (copied %zu bytes)\n",
               i, buffer_size, timing, copied);
        close(fichier2);
        remove("grosfichier_copie");

        buffer_size *= 2;
    }
    free(buffer);
    close(fichier1);

    return 0;
}

When I ran it (with two timing commands giving times), I got: 当我运行它时(有两个计时命令给出时间),我得到了:

2018-01-15 08:00:27 [PID 43372] copy43
0, buffer size =         1, time : 278480098 (copied 50000000 bytes)
1, buffer size =         2, time : 106462932 (copied 50000000 bytes)
2, buffer size =         4, time : 53933508 (copied 50000000 bytes)
3, buffer size =         8, time : 27316467 (copied 50000000 bytes)
4, buffer size =        16, time : 13451731 (copied 50000000 bytes)
5, buffer size =        32, time :  6697516 (copied 50000000 bytes)
6, buffer size =        64, time :  3459170 (copied 50000000 bytes)
7, buffer size =       128, time :  1683163 (copied 50000000 bytes)
8, buffer size =       256, time :   882365 (copied 50000000 bytes)
9, buffer size =       512, time :   457335 (copied 50000000 bytes)
10, buffer size =      1024, time :   240605 (copied 50000000 bytes)
11, buffer size =      2048, time :   126771 (copied 50000000 bytes)
12, buffer size =      4096, time :    70834 (copied 50000000 bytes)
13, buffer size =      8192, time :    46279 (copied 50000000 bytes)
14, buffer size =     16384, time :    35227 (copied 50000000 bytes)
15, buffer size =     32768, time :    27996 (copied 50000000 bytes)
16, buffer size =     65536, time :    28486 (copied 50000000 bytes)
17, buffer size =    131072, time :    24203 (copied 50000000 bytes)
18, buffer size =    262144, time :    26015 (copied 50000000 bytes)
19, buffer size =    524288, time :    19484 (copied 50000000 bytes)
20, buffer size =   1048576, time :    28851 (copied 50000000 bytes)
2018-01-15 08:08:47 [PID 43372; status 0x0000]  -  8m 19s

real    8m19.351s
user    1m21.231s
sys 6m52.312s

As you can see, the 1-byte copying was dramatically awful and took something like 4 minutes of wall clock time to copy the data. 如您所见,1字节的复制非常糟糕,花费了大约4分钟的挂钟时间来复制数据。 Using 2 bytes halved that; 使用2个字节减半; 4 bytes halved it again, and the improvements kept going until about 32 KiB. 4字节再次将其减半,并且改进一直持续到大约32 KiB。 After that, the performance was steady — and fast (the last few lines appeared in what seemed like under a second each, but I wasn't paying close attention). 之后,性能稳定且快速(最后几行似乎每秒钟不到一秒,但我没有密切注意)。 I'd put in alternative wall-clock timing using clock_gettime() (or gettimeofday() if that's not available) to time each cycle. 我将使用clock_gettime() (或gettimeofday()如果不可用))为每个周期设置备用的时钟时间。 I was worried at first with the lack of progress on the single byte copying, but a second terminal window confirmed the copy was growing, but oh so slowly! 起初,我担心单字节复制缺乏进展,但是第二个终端窗口确认了复制正在增长,但是速度太慢了!

It's been a while since this thread was active, but I though I'd add to Andrew Henle's post.这个帖子已经有一段时间没有活跃了,但我还是想添加到 Andrew Henle 的帖子中。

To get a better idea of the real time involved in copying files, one could add an fsync(2) after the forever-loop exits and before copy() returns.为了更好地了解复制文件所涉及的实时性,可以在永久循环退出之后和 copy() 返回之前添加一个fsync(2) fsync(2) will make sure all the data in the systems buffers has been sent to the underlying storage device. fsync(2)将确保系统缓冲区中的所有数据都已发送到底层存储设备。 Note, however, that most disk drives have an onboard cache that can buffer writes, again, masking the actual time it takes to write to the media.但是请注意,大多数磁盘驱动器都有一个板载缓存,可以缓冲写入,再次掩盖写入介质所需的实际时间。

The vast majority of code that I write is for safety critical systems.我编写的绝大多数代码都是针对安全关键系统的。 Those are systems that, if they malfunction, can cause serious injury or death, or serious environmental damage.这些系统如果发生故障,可能会导致严重伤害或死亡,或严重的环境破坏。 Such systems can be found in modern aircraft, nuclear power plants, medical devices, and automobile computers, just to name a few.这种系统可以在现代飞机、核电站、医疗设备和汽车计算机中找到,仅举几例。

One of the rules applying to source code for safety critical systems is that loops must have a clear condition to break out of the loop.适用于安全关键系统源代码的规则之一是循环必须具有明确的条件才能跳出循环。 By "clear", the break condition must be expressed in the for , while , or do-while , and not somewhere within the compound statement.通过“清除”,中断条件必须在forwhiledo-while中表达,而不是在复合语句中的某处。

I understand exactly what Andrew wrote.我完全理解安德鲁写的内容。 The intent is clear.意图很明确。 It's concise.它很简洁。 There's nothing wrong with it.这没什么不对的。 And it's an excellent suggestion.这是一个很好的建议。

But (here's the "but"), the condition in the for appears at first glance to be infinite:但是(这里是“但是”), for中的条件乍一看似乎是无限的:

for (;; ) {... }为了 (;; ) {... }

Why is this important?为什么这很重要? Source code validators would flag this as an infinite loop.源代码验证器会将此标记为无限循环。 Then you get dinged on your performance review, you don't get the raise you were expecting, your wife gets mad at you, files for a divorce, takes everything you own, and takes off with your divorce lawyer.然后你的绩效评估结果很糟糕,你没有得到预期的加薪,你的妻子生你的气,申请离婚,带走你所有的一切,并与你的离婚律师一起离开。 And THAT's why it's important.就是为什么它很重要。

I'd like to suggest an alternate structure:我想建议一个替代结构:

 void copy( int in, int out, char *buffer, size_t bufsize ) { ssize_t bytes_read; switch(1) do { ssize_t bytes_written; bytes_written = write( out, buffer, bytes_read ); if ( bytes_written:= bytes_read ) { // error handling code } default. // Loop entry point is here, bytes_read = read( in, buffer; bufsize ); } while (bytes_read > 0 ); fsync(out); }
I first ran across a switch-loop structure like this in the mid-80's. 我第一次遇到像这样的switch-loop结构是在 80 年代中期。 It was an effort to optimize the use of a pipelined architecture by avoiding departures from the execution of sequential instructions. 这是通过避免偏离顺序指令的执行来优化流水线架构的使用的努力。

Suppose you had a simple routine that had to do a few things a great number of times.假设您有一个简单的例程,必须多次执行一些操作。 Copying data from one buffer to another is a perfect example.将数据从一个缓冲区复制到另一个缓冲区就是一个很好的例子。

 char *srcp, *dstp; // source and destination pointers int count; // number of bytes to copy (must be > 0)... while (count--) { *dstp++ = *srcp++; }...

Simple enough.很简单。 Right?正确的?

Downside: Every iteration around the loop, the processor has to jump back to the start of the loop, and in doing so, it dumps whatever is in the prefetch pipeline.缺点:围绕循环的每次迭代,处理器都必须跳回到循环的开始,并且在这样做时,它会转储预取管道中的所有内容。

Using a technique called "loop unrolling", this can be rewritten to take advantage of a pipeline:使用一种称为“循环展开”的技术,可以重写它以利用管道:

 char *srcp, *dstp; // source and destination pointers int count; // number of bytes to copy (must be > 0)... switch (count % 8) do { case 0: *dstp++ = *srcp++; --count; case 7: *dstp++ = *srcp++; --count; case 6: *dstp++ = *srcp++; --count; case 5: *dstp++ = *srcp++; --count; case 4: *dstp++ = *srcp++; --count; case 3: *dstp++ = *srcp++; --count; case 2: *dstp++ = *srcp++; --count; case 1: *dstp++ = *srcp++; --count; } while (count > 0); ...

Follow it through.跟进它。 The first statement executed is the switch .执行的第一个语句是switch It takes the low three bits of count and jumps to the appropriate case label. Each case copies the data, increments the pointers, and decrements the count, then falls through to the next case .它采用计数的低三位并跳转到适当的case label。每个案例复制数据、递增指针并递减计数,然后跳转到下一个case

When it gets to the bottom, the while condition is evaluated, and, if true, continues execution at the top of the do..while .当它到达底部时,将评估while条件,如果为真,则继续在do..while的顶部执行。 It does not re-execute the switch .不会重新执行switch

The advantage is that the machine code produced is a longer series of sequential instructions, and therefore executes fewer jumps taking greater advantage of a pipelined architecture.优点是生成的机器代码是一系列更长的顺序指令,因此执行更少的跳转,从而更好地利用流水线架构。

As noted in the comments, this code is wrong: 如注释中所述,此代码是错误的:

void copy (int in, int out, char *buffer, long long taille) {
  int t;

  while ((t = read(in, &buffer, sizeof taille))> 0)
    write (out, &buffer, t);


  if (t < 0)
    perror("read");
}

First, a minor issue: both read() and write() return ssize_t , not int . 首先,一个小问题: read()write()返回ssize_t ,而不是int

Second, you're ignoring the return value from write() , so you never really know how much gets written. 其次,您忽略了write()的返回值,因此您永远不会真正知道要写入多少内容。 This may or may not be a problem in your code, but you won't detect a failed copy from a filled-up filesystem, for example. 这可能是代码中的问题,也可能不是问题,但是,例如,您不会从已填充的文件系统中检测到失败的副本。

Now, for the real problems. 现在,对于真正的问题。

read(in, &buffer, sizeof taille)

&buffer is wrong. &buffer错误。 buffer is a char * - a variable in memory containing the address of a char buffer. buffer是一个char * -内存中包含char缓冲区地址的变量。 That's telling read() to put the data it reads from the in file descriptor in the memory occupied by the buffer pointer variable itself, and not the actual memory that the address held in the buffer pointer variable refers to. 这是告诉read()把它从读取数据in文件描述符由占用的内存buffer指针变量本身,而不是实际的内存,在保存的地址buffer指针变量引用。 You simply need buffer . 您只需要buffer

sizeof taille is also wrong. sizeof taille也是错误的。 That's the size of the taille variable itself - as a long long it's likely 8 bytes. 那就是taille变量本身的大小- long long ,很可能是8个字节。

If you're trying to copy the entirety of a file: 如果您要复制整个文件:

void copy( int in, int out, char *buffer, size_t bufsize )
{
    // why stuff three or four operations into
    // the conditional part of a while()??
    for ( ;; )
    {
        ssize_t bytes_read = read( in, buffer, bufsize );
        if ( bytes_read <= 0 )
        {
            break;
        }

        ssize_t bytes_written = write( out, buffer, bytes_read );
        if ( bytes_written != bytes_read )
        {
            // error handling code
        }
    }
 }

It's that simple. 就这么简单。 The hard part is the error handling for any possible failure. 困难的部分是任何可能失败的错误处理。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM