简体   繁体   English

在C中使用mmap或fscanf读取文件

[英]Read a file either with mmap or fscanf in C

I'm asking both an advice and an opinion 我在问一个建议和一个意见

I have a file made of couples on integers, for example 例如,我有一个由整数对组成的文件

1 2
1 3
4 7
2 5
3 10 

Now, I want to read it, but every method I can think of, has his own problems. 现在,我想阅读它,但是我能想到的每种方法都有自己的问题。

Using the mmap() function, it returns a string of character, but extracting the numbers from it seems very painful, since I don't know their lenghts, so the use atoi() or the classic Number-'0' seems not to be enough. 使用mmap()函数,它返回一个字符串,但是从中提取数字似乎很痛苦,因为我不知道它们的长度,因此使用atoi()或经典的Number-'0'似乎并不足够。

On the other hand, the use of fscanf gives me directly the numbers in their integer form, but I always have issues on its termination. 另一方面,使用fscanf可以直接为我提供整数形式的数字,但是我总是在终止时遇到问题。 How do you know when it has finished reading? 您怎么知道什么时候读完呢? Does it return '\\0' or EOF or something else? 它是否返回'\\ 0'或EOF或其他? By experience, it seems to me that it behaves randomly.. It could be useful to use a function that counts the number of lines of a file, but does that even exist? 根据经验,在我看来它的行为是随机的。使用一个计算文件行数的函数可能会很有用,但是它甚至存在吗?

Now, to you. 现在,给你。 Which method would you prefer to use? 您想使用哪种方法? And how would you fix the problems above? 您将如何解决上述问题?

You can use strtol to convert the numbers from your mmap() ped buffer and a line you've read via traditional I/O alike. 您可以使用strtol转换mmap() ped缓冲区中的数字以及通过传统I / O读取的行。 I find it very convenient (in fact, even more convenient than fscanf under most circumstances). 我发现它非常方便(实际上,在大多数情况下,它甚至比fscanf更方便)。 If you want to find the next newline in your buffer, carefully(!) using memchr is a very efficient way of doing so. 如果要在缓冲区中找到下一个换行符,请谨慎地(!)使用memchr是这样做的一种非常有效的方法。 It can then give you the next pointer to pass to strtol . 然后,它可以给您下一个传递给strtol指针。

If you want generality, you should take precautions that not every file can be mmap() ped (eg pipes). 如果您希望具有通用性,则应采取预防措施,以确保不是每个文件都可以mmap() ped(例如管道)。 Therefore, a robust program should try to mmap() the file and if that fails, fall back to traditional I/O. 因此,一个健壮的程序应尝试对文件进行mmap()操作,如果失败,则应使用传统的I / O。

mmap() with an ad-hoc parser is as fast as possible, even if it isn't very flexible. 即使不是很灵活,带有临时解析器的mmap()也会尽可能快。 The below will parse files of the form you gave and perhaps no other , but if it were mechanically generated that might be ok: 下面将解析您给定的文件格式,也许没有其他文件 ,但如果是机械生成的文件,则可以:

char*p,*e,*x;
int m,n;
x=mmap(...); /* e=end of buffer; */
for(m=n=0,p=x;p<e;++p){
  if(*p==' '){m=n;n=0;}
  else if(*p=='\n'){emit(m,n);m=n=0;}
  else{n*=10;n+=*p-'0';}}

A better file format (binary) is faster still. 更好的文件格式(二进制)仍然更快。


Regarding your second question: how do I know when fscanf() is at EOF? 关于您的第二个问题: 我怎么知道fscanf()何时处于EOF? . That's what feof(fp) does. 这就是feof(fp)所做的。 You want something like: 您想要类似的东西:

while(feof(fp)&&2==fscanf(fp,"%d %d\n",&m,&n))emit(m,n);

but beware: This is much slower than the above, and not much more robust. 但要注意:这是比上面慢得多,而不是更强大。 How much slower? 慢多少? On my mid-2012 MBA I'll get around 600mbps, while using fscanf I'll be lucky to get 10mbps. 在2012年中期的MBA上,我的速度将达到600mbps,而使用fscanf我将很幸运地获得10mbps的速度。

Using fscanf is very easy. 使用fscanf非常简单。 fscanf returns the number of items successfully scanned. fscanf返回成功扫描的项目数。 So in your case you can use: 因此,您可以使用:

while(fscanf(fp,"%d %d",&int1,&int2)==2)
{ 
// successfully scanned 2 integers
}

Where fp is a file pointer and int1 and int2 are variables of type int . 其中fp是文件指针, int1int2int类型的变量。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM