简体   繁体   English

使用C中的fread从stdin缓冲读取

[英]Buffered reading from stdin using fread in C

I am trying to efficiently read from the stdin by using setvbuf in `_IOFBF~ mode. 我试图通过在`_IOFBF~模式中使用setvbuf来有效地读取stdin I am new to buffering. 我是新来的缓冲。 I am looking for working examples. 我正在寻找有用的例子。

The input begins with two integers ( n , k ). 输入以两个整数( nk )开头。 The next n lines of input contain 1 integer. 接下来的n行输入包含1个整数。 The aim is to print how many integers are divisible by k . 目的是打印可以被k整除的k

#define BUFSIZE 32
int main(){
  int n, k, tmp, ans=0, i, j;
  char buf[BUFSIZE+1] = {'0'};
  setvbuf(stdin, (char*)NULL, _IONBF, 0);
  scanf("%d%d\n", &n, &k);
  while(n>0 && fread(buf, (size_t)1, (size_t)BUFSIZE, stdin)){
    i=0; j=0;
    while(n>0 && sscanf(buf+j, "%d%n", &tmp, &i)){
    //printf("tmp %d - scan %d\n",tmp,i); //for debugging
      if(tmp%k==0)  ++ans;
      j += i; //increment the position where sscanf should read from
      --n;
    }
  }
  printf("%d", ans);
  return 0;
}

The problem is if number is at the boundary, the buffer buf will read 23 from 2354\\n , when it should have either read 2354 (which it cannot) or nothing at all. 问题是如果数字在边界处, 缓冲区 buf将从2354\\n读取23 ,此时它应该读取2354 (它不能)或者根本不读取。

How can I solve this issue? 我该如何解决这个问题?


Edit 编辑
Resolved now (with analysis) . 现已解决(通过分析)

Edit 编辑
Complete Problem Specification 完整的问题规范

I am going to recommend trying full buffering with setvbuf and ditching fread . 我将建议尝试使用setvbuf和ditching fread完全缓冲。 If the specification is that there is one number per line, I will take that for granted, use fgets to read in a full line and pass it to strtoul parse the number that is supposed to be on that line. 如果规范是每行有一个数字,我会认为这是理所当然的,使用fgets读取整行并将其传递给strtoul解析应该在该行上的数字。

#include <errno.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

#define INITIAL_BUFFER_SIZE 2 /* for testing */

int main(void) {
    int n;
    int divisor;
    int answer = 0;
    int current_buffer_size = INITIAL_BUFFER_SIZE;
    char *line = malloc(current_buffer_size);

    if ( line == NULL ) {
        return EXIT_FAILURE;
    }

    setvbuf(stdin, (char*)NULL, _IOFBF, 0);

    scanf("%d%d\n", &n, &divisor);

    while ( n > 0 ) {
        unsigned long dividend;
        char *endp;
        int offset = 0;
        while ( fgets(line + offset, current_buffer_size, stdin) ) {
            if ( line[strlen(line) - 1] == '\n' ) {
                break;
            }
            else {
                int new_buffer_size = 2 * current_buffer_size;
                char *tmp = realloc(line, new_buffer_size);
                if ( tmp ) {
                    line = tmp;
                    offset = current_buffer_size - 1;
                    current_buffer_size = new_buffer_size;
                }
                else {
                    break;
                }
            }
        }
        errno = 0;
        dividend = strtoul(line, &endp, 10);
        if ( !( (endp == line) || errno ) ) {
            if ( dividend % divisor == 0 ) {
                answer += 1;
            }
        }
        n -= 1;
    }

    printf("%d\n", answer);
    return 0;
}

I used a Perl script to generate 1,000,000 random integers between 0 and 1,000,000 and checked if they were divisible by 5 after compiling this program with gcc version 3.4.5 (mingw-vista special r3) on my Windows XP laptop. 我使用Perl脚本生成1,000,000个0到1,000,000之间的随机整数,并在我的Windows XP笔记本电脑上使用gcc version 3.4.5 (mingw-vista special r3)编译该程序后检查它们是否可被5整除。 The whole thing took less than 0.8 seconds. 整件事花了不到0.8秒。

When I turned buffering off using setvbuf(stdin, (char*)NULL, _IONBF, 0); 当我使用setvbuf(stdin, (char*)NULL, _IONBF, 0);关闭缓冲时setvbuf(stdin, (char*)NULL, _IONBF, 0); , the time went up to about 15 seconds. ,时间上升到大约15秒。

One thing that I find confusing is why you are both enabling full buffering within the stream object via the call to setvbuf and doing your own buffering by reading a full buffer into buf . 我发现令人困惑的一件事是你为什么要通过调用setvbuf在流对象中启用完全缓冲,并通过将完整缓冲区读入buf来进行自己的缓冲。

I understand the need to do buffering, but that is a bit overkill. 我理解需要做缓冲,但这有点矫枉过正。

I'm going to recommend you stick with setvbuf and remove your own buffering. 我打算建议你坚持使用setvbuf并删除你自己的缓冲。 The reason why is that implementing your own buffering can be tricky. 原因是实现自己的缓冲可能很棘手。 The problem is what will happen when a token (in your case a number) straddles the buffer boundary. 问题是当一个令牌(在你的情况下是一个数字)跨越缓冲区边界时会发生什么。 For example, let's say your buffer is 8 bytes (9 bytes total for trailing NULL) and your input stream looks like 例如,假设您的缓冲区是8个字节(总共9个字节用于尾随NULL),您的输入流看起来像

12345 12345

The first time you fill the buffer you get: 第一次填充缓冲区时,您会得到:

"12345 12"

while the second time you fill the buffer you get: 而第二次填充缓冲区时,你会得到:

"345"

Proper buffering requires you to handle that case so you treat the buffer as the two numbers {12345, 12345} and not three numbers {12345, 12, 234}. 正确的缓冲需要您处理该情况,因此您将缓冲区视为两个数字{12345,12345}而不是三个数字{12345,12,234}。

Since stdio handles that already for you, just use that. 由于stdio处理已经为你而已,只需使用它。 Continue to call setvbuf , get rid of the fread and use scanf to read individual numbers from the input stream. 继续调用setvbuf ,摆脱fread并使用scanf从输入流中读取单个数字。

Version 1 : Using getchar_unlocked as suggested by R Samuel Klatchko (see comments) 版本1:使用getchar_unlocked由R塞缪尔Klatchko(见注释)的建议

#define BUFSIZE 32*1024
int main(){
  int lines, number=0, dividend, ans=0;
  char c;
  setvbuf(stdin, (char*)NULL, _IOFBF, 0);// full buffering mode
  scanf("%d%d\n", &lines, &dividend);
  while(lines>0){
    c = getchar_unlocked();
    //parse the number using characters
    //each number is on a separate line
    if(c=='\n'){
      if(number % dividend == 0)    ans += 1;
      lines -= 1;
      number = 0;
    }
    else
      number = c - '0' + 10*number;
  }

  printf("%d are divisible by %d \n", ans, dividend);
  return 0;
}

Version 2: Using fread to read a block and parsing number from it. 版本2:使用fread读取块并从中解析数字。

#define BUFSIZE 32*1024
int main(){
int lines, number=0, dividend, ans=0, i, chars_read;
char buf[BUFSIZE+1] = {0}; //initialise all elements to 0
scanf("%d%d\n",&lines, &dividend);

while((chars_read = fread(buf, 1, BUFSIZE, stdin)) > 0){
  //read the chars from buf
  for(i=0; i < chars_read; i++){
    //parse the number using characters
    //each number is on a separate line
    if(buf[i] != '\n')
      number = buf[i] - '0' + 10*number;
    else{
      if(number%dividend==0)    ans += 1;
      lines -= 1;
      number = 0;
    }       
  }

if(lines==0)  break;
}

printf("%d are divisible by %d \n", ans, dividend);
return 0;
}

Results: (10 million numbers tested for divisibility by 11) 结果:(1000万个数字的可分性测试结果为11个)

Run 1: ( Version 1 without setvbuf ) 0.782 secs 运行1 :(没有setvbuf的版本1)0.782秒
Run 2: ( Version 1 with setvbuf ) 0.684 secs 运行2 :(带有setvbuf的版本1)0.684秒
Run 3: ( Version 2 ) 0.534 运行3 :(版本2)0.534

PS - Every run compiled with GCC using -O1 flag PS - 使用-O1标志用GCC编译的每次运行

The problem when you are not using redirection is that you are not causing EOF. 不使用重定向时的问题是您没有导致EOF。

Since this appears to be Posix (based on the fact you are using gcc), just type ctrl-D (ie while pressing the control button, press/release d) which will cause EOF to be reached. 由于这似乎是Posix(基于您使用gcc的事实),只需键入ctrl-D (即按下控制按钮,按下/释放d),这将导致到达EOF。

If you are using Windows, I believe you use ctrl-Z instead. 如果您使用的是Windows,我相信您使用的是ctrl-Z

If you are after out-and-out speed and you work on a POSIX-ish platform, consider using memory mapping. 如果您在完成速度并且在POSIX-ish平台上工作,请考虑使用内存映射。 I took Sinan's answer using standard I/O and timed it, and also created the program below using memory mapping. 我使用标准I / O获取了Sinan的答案并定时,并使用内存映射创建了下面的程序。 Note that memory mapping will not work if the data source is a terminal or a pipe and not a file. 请注意,如果数据源是终端或管道而不是文件,则内存映射将不起作用。

With one million values between 0 and one billion (and a fixed divisor of 17), the average timings for the two programs was: 有一百万个值在0到十亿之间(固定除数为17),这两个程序的平均时间是:

  • standard I/O: 0.155s 标准I / O:0.155s
  • memory mapped: 0.086s 内存映射:0.086s

Roughly, memory mapped I/O is twice as fast as standard I/O. 粗略地说,内存映射I / O的速度是标准I / O的两倍。

In each case, the timing was repeated 6 times, after ignoring a warm-up run. 在每种情况下,在忽略预热运行之后,时间重复6次。 The command lines were: 命令行是:

time fbf < data.file    # Standard I/O (full buffering)
time mmf < data.file    # Memory mapped file I/O

#include <ctype.h>
#include <errno.h>
#include <limits.h>
#include <stdarg.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/mman.h>
#include <sys/stat.h>

static const char *arg0 = "**unset**";
static void error(const char *fmt, ...)
{
    va_list args;
    fprintf(stderr, "%s: ", arg0);
    va_start(args, fmt);
    vfprintf(stderr, fmt, args);
    va_end(args);
    exit(EXIT_FAILURE);
}

static unsigned long read_integer(char *src, char **end)
{
    unsigned long v;
    errno = 0;
    v = strtoul(src, end, 0);
    if (v == ULONG_MAX && errno == ERANGE)
        error("integer too big for unsigned long at %.20s", src);
    if (v == 0 && errno == EINVAL)
        error("failed to convert integer at %.20s", src);
    if (**end != '\0' && !isspace((unsigned char)**end))
        error("dubious conversion at %.20s", src);
    return(v);
}

static void *memory_map(int fd)
{
    void *data;
    struct stat sb;
    if (fstat(fd, &sb) != 0)
        error("failed to fstat file descriptor %d (%d: %s)\n",
              fd, errno, strerror(errno));
    if (!S_ISREG(sb.st_mode))
        error("file descriptor %d is not a regular file (%o)\n", fd, sb.st_mode);
    data = mmap(0, sb.st_size, PROT_READ, MAP_PRIVATE, fileno(stdin), 0);
    if (data == MAP_FAILED)
        error("failed to memory map file descriptor %d (%d: %s)\n",
              fd, errno, strerror(errno));
    return(data);
}

int main(int argc, char **argv)
{
    char *data;
    char *src;
    char *end;
    unsigned long k;
    unsigned long n;
    unsigned long answer = 0;
    size_t i;

    arg0 = argv[0];
    data = memory_map(0);

    src = data;

    /* Read control data */
    n = read_integer(src, &end);
    src = end;
    k = read_integer(src, &end);
    src = end;

    for (i = 0; i < n; i++, src = end)
    {
        unsigned long v = read_integer(src, &end);
        if (v % k == 0)
            answer++;
    }

    printf("%lu\n", answer);
    return(0);
}

You can use the value of n to stop reading the input after you've seen n integers. 在看到n整数后,可以使用n的值来停止读取输入。

Change the condition of the outer while loop to: 将外部while循环的条件更改为:

while(n > 0 && fread(buf, sizeof('1'), BUFSIZE, stdin))

and change the body of the inner one to: 并将内部的身体改为:

{
  n--;
  if(tmp%k == 0)  ++ans;
}

The problem you're continuing to have is that because you never adjust buf in the inner while loop, sscanf keeps reading the same number over and over again. 你继续遇到的问题是,因为你永远不会在内部while循环中调整buf ,所以sscanf会一遍又一遍地读取相同的数字。

If you switch to using strtol() intead of sscanf() , then you can use the endptr output parameter to move through the buffer as numbers are read. 如果切换到使用sscanf() strtol() endptr ,则可以使用endptr输出参数在读取数字时移动缓冲区。

Well, right off the top, scanf("%d%d",&n,&k) will shove a value into n only and silently leave k unset - You'd see this if you checked the return value of scanf(), which tells you how many variables it filled. 好吧,从顶部开始,scanf(“%d%d”,&n,&k)将仅将值推入n并静默地保持k未设置 - 如果检查了scanf()的返回值,则会看到这一点,告诉你它填充了多少变量。 I think you want scanf("%d %d",&n,&k) with the space. 我想你想要scanf(“%d%d”,&n,&k)和空格。

Second, n is the number of iterations to run, but you test for "n>0" yet never decrement it. 其次,n是要运行的迭代次数,但是你测试的是“n> 0”但是从不减少它。 Ergo, n>0 is always true and the loop won't exit. 因此,n> 0始终为真且循环不会退出。

As someone else mentioned, feeding stdin over a pipe causes the loop to exit because the end of stdin has an EOF, which causes fread() to return NULL, exiting the loop. 正如其他人提到的那样,在管道上输入stdin会导致循环退出,因为stdin的结尾有一个EOF,导致fread()返回NULL,退出循环。 You probably want to add an "n=n-1" or "n--" somewhere in there. 你可能想在那里的某处添加一个“n = n-1”或“n--”。

Next, in your sscanf, %n is not really a standard thing; 接下来,在你的sscanf中,%n并不是一个标准的东西; I'm not sure what it's meant to do, but it may do nothing: scanf() generally stops parsing at the first unrecognized format identifier, which does nothing here (since you already got your data,) but is bad practice. 我不确定这是做什么的,但它可能什么都不做:scanf()通常会停止解析第一个无法识别的格式标识符,这里没有任何作用(因为你已经获得了数据),但这是不好的做法。

Finally, if performance is important, you'd be better off not using fread() etc at all, as they're not really high performance. 最后,如果性能很重要,那么最好不要使用fread()等,因为它们的性能并不高。 Look at isdigit(3) and iscntrl(3) and think about how you could parse the numbers from a raw data buffer read with read(2). 查看isdigit(3)和iscntrl(3)并考虑如何从read(2)读取的原始数据缓冲区中解析数字。

The outermost while() loop will only exit when the read from stdin returns EOF . 最外面的while()循环只有在从stdin读取时返回EOF时才会退出。 This can only happen when reaching the actual end-of-file on an input file, or if the process writing to an input pipe exits. 这只能在到达输入文件的实际文件结束时,或者如果写入输入管道的进程退出时才会发生。 Hence the printf() statement is never executed. 因此,从不执行printf()语句。 I don't think this has anything to do with the call to setvbuf() . 我不认为这与调用setvbuf()有任何关系。

Mabe also take a look at this getline implementation: Mabe还看一下这个getline实现:

http://www.cpax.org.uk/prg/portable/c/libs/sosman/index.php http://www.cpax.org.uk/prg/portable/c/libs/sosman/index.php

(An ISO C routine for getting a line of data, length unknown, from a stream.) (用于从流中获取一行数据,长度未知的ISO C例程。)

The reason all this permature optimisation has a negligable effect on the runtime is that in *nix and windows type operating systems the OS handles all I/O to and from the file system and implements 30 years worth of research, trickery and deviousness to do this very efficiently. 所有这些过早优化对运行时具有可忽略影响的原因是,在* nix和windows类型操作系统中,OS处理进出文件系统的所有I / O,并实现了30年的研究,欺骗和欺骗性非常有效率。

The buffering you are trying to control is merely the block of memory used by your program. 您尝试控制的缓冲仅仅是程序使用的内存块。 So any increases in speed will be minimal (the effect of doing 1 large 'mov' verses 6 or 7 smaller 'mov' instructions). 所以速度的任何增加都是最小的(做1个大'mov'对6或7个较小的'mov'指令的效果)。

If you really want to speed this up try "mmap" which allows you direct access the data in the file systems buffer. 如果你真的想加快速度,请尝试“mmap”,它允许你直接访问文件系统缓冲区中的数据。

Here's my byte-by-byte take on it: 这是我对逐字节的看法:

/*

Buffered reading from stdin using fread in C,
http://stackoverflow.com/questions/2371292/buffered-reading-from-stdin-for-performance

compile with:
gcc -Wall -O3  fread-stdin.c

create numbers.txt:
echo 1000000 5 > numbers.txt
jot -r 1000000 1 1000000 $RANDOM >> numbers.txt

time -p cat numbers.txt | ./a.out

*/

#include <stdio.h>
#include <stdlib.h>
#include <limits.h>

#define BUFSIZE 32

int main() {

   int n, k, tmp, ans=0, i=0, countNL=0;
   char *endp = 0;

   setvbuf(stdin, (char*)NULL, _IOFBF, 0);       // turn buffering mode on
   //setvbuf(stdin, (char*)NULL, _IONBF, 0);     // turn buffering mode off

   scanf("%d%d\n", &n, &k);

   char singlechar = 0;
   char intbuf[BUFSIZE + 1] = {0};

   while(fread(&singlechar, 1, 1, stdin))     // fread byte-by-byte
   {
      if (singlechar == '\n') 
      {
         countNL++;
         intbuf[i] = '\0';
         tmp = strtoul(intbuf, &endp, 10);
         if( tmp % k == 0) ++ans;
         i = 0;
      } else {
         intbuf[i] = singlechar; 
         i++;
      }
      if (countNL == n) break;
   }

   printf("%d integers are divisible by %d.\n", ans, k);
   return 0;

}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM