简体   繁体   English

将文件读入字符串缓冲区并检测EOF

[英]Reading a file into a string buffer and detecting EOF

I am opening a file and placing it's contents into a string buffer to do some lexical analysis on a per-character basis. 我正在打开一个文件,并将其内容放入字符串缓冲区中,以便根据每个字符进行一些词法分析。 Doing it this way enables parsing to finish faster than using a subsequent number of fread() calls, and since the source file will always be no larger than a couple MBs, I can rest assured that the entire contents of the file will always be read. 通过这种方式,解析可以比使用随后的fread()调用更快地完成,并且由于源文件始终不大于几个MB,因此我可以放心,始终将读取文件的全部内容。

However, there seems to be some trouble in detecting when there is no more data to be parsed, because ftell() often gives me an integer value higher than the actual number of characters within the file. 但是,检测何时没有更多数据要解析似乎有些麻烦,因为ftell()通常给我一个比文件中实际字符数高的整数值。 This wouldn't be a problem with the use of the EOF (-1) macro, if the trailing characters were always -1... But this is not always the case... 如果尾随的字符始终为-1,那么使用EOF(-1)宏就不会有问题。但是,情况并非总是如此...


Here's how I am opening the file, and reading it into the string buffer: 这是我打开文件并将其读入字符串缓冲区的方式:

FILE *fp = NULL;
errno_t err = _wfopen_s(&fp, m_sourceFile, L"rb, ccs=UNICODE");
if(fp == NULL || err != 0) return FALSE;
if(fseek(fp, 0, SEEK_END) != 0) {
    fclose(fp);
    fp = NULL;
    return FALSE;
}

LONG fileSize = ftell(fp);
if(fileSize == -1L) {
    fclose(fp);
    fp = NULL;
    return FALSE;
}
rewind(fp);

LPSTR s = new char[fileSize];
RtlZeroMemory(s, sizeof(char) * fileSize);
DWORD dwBytesRead = 0;
if(fread(s, sizeof(char), fileSize, fp) != fileSize) {
    fclose(fp);
    fp = NULL;
    return FALSE;
}

This always appears to work perfectly fine. 这似乎总是可以正常工作。 Following this is a simple loop, which checks the contents of the string buffer one character at a time, like so: 这之后是一个简单的循环,它一次检查一个字符的字符串缓冲区的内容,如下所示:

char c = 0;
LONG nPos = 0;
while(c != EOF && nPos <= fileSize)
{
    c = s[nPos];
    // do something with 'c' here...
    nPos++;
}

The trailing bytes of the file are usually a series of ý (-3) and « (-85) characters, and therefore EOF is never detected. 文件的尾随字节通常是一系列的ý (-3)« (-85)字符,因此永远不会检测到EOF。 Instead, the loop simply continues onward until nPos ends up being of higher value than fileSize -- Which is not desirable for proper lexical analysis, because you often end up skipping the final token in a stream which omits a newline character at the end. 取而代之的是,循环一直继续下去,直到nPos的值最终大于fileSize为止 -这对于正确的词法分析是不理想的,因为您通常最终会跳过流中的最后一个标记,该标记最后会忽略换行符。


In a Basic Latin character set, would it be safe to assume that an EOF char is any character with a negative value? 在基本拉丁字符集中,可以安全地假设EOF字符是任何具有负值的字符吗? Or perhaps there is just a better way to go about this? 也许只有更好的方法可以解决此问题?


#EDIT: I have just tried to implement the feof() function into my loop, and all the same, it doesn't seem to detect EOF either. #EDIT:我刚刚尝试将feof()函数实现到我的循环中,并且都一样,它似乎也无法检测到EOF。

Assembling comments into an answer... 将评论汇总为答案...

  • You leak memory (potentially a lot of memory) when you fail to read. 当您无法读取时,您会泄漏内存(可能会占用大量内存)。

  • You haven't allowed for a null terminator at the end of the string read. 您不允许在读取的字符串末尾使用空终止符。

  • There's no point in zeroing the memory when it is all about to be overwritten by the data from the file. 当文件中的数据全部将其覆盖时,将内存清零没有任何意义。

  • Your test loop is accessing memory out of bounds; 您的测试循环正在访问内存。 nPos == fileSize is one beyond the end of the memory you allocated. nPos == fileSize是超出分配的内存末尾的1。

     char c = 0; LONG nPos = 0; while(c != EOF && nPos <= fileSize) { c = s[nPos]; // do something with 'c' here... nPos++; } 
  • There are other problems, not previously mentioned, with this. 这样做还有其他问题,以前没有提到。 You did ask if it is 'safe to assume that an EOF char is any character with a negative value', to which I responded No . 您确实问过“是否可以安全地假定EOF字符是具有负值的任何字符”,对此我回答了 There are several issues here, that affect both C and C++ code. 这里有几个影响C和C ++代码的问题。 The first is that plain char may be a signed type or an unsigned type. 第一个是普通char可以是有符号类型或无符号类型。 If the type is unsigned, then you can never store a negative value in it (or, more accurately, if you attempt to store a negative integer into an unsigned char, it will be truncated to the least significant 8 * bits and will be treated as positive. 如果类型是无符号的,则您永远不能在其中存储负值(或更准确地说,如果您尝试将负整数存储到无符号的char中,它将被截断为最低有效8 *位,并将被处理为积极。

  • In the loop above, one of two problems can occur. 在上面的循环中,可能会出现两个问题之一。 If char is a signed type, then there is a character (ÿ, y-umlaut, U+00FF, LATIN SMALL LETTER Y WITH DIAERESIS, 0xFF in the Latin-1 code set) that has the same value as EOF (which is always negative and usually -1). 如果char是带符号的类型,则存在一个字符(ÿ,y-umlaut,U + 00FF,带DIAERESIS的拉丁小写字母Y,Latin-1代码集中的0xFF),其值与EOF相同(始终为负数,通常为-1)。 Thus, you might detect EOF prematurely. 因此,您可能会过早检测到EOF。 If char is an unsigned type, then there will never be any character equal to EOF. 如果char是无符号类型,则永远不会有等于EOF的字符。 But the test for EOF on a character string is fundamentally flawed; 但是对字符串进行EOF的测试从根本上来说是有缺陷的。 EOF is a status indicator from I/O operations and not a character. EOF是来自I / O操作的状态指示器,而不是字符。

  • During I/O operations, you will only detect EOF when you've attempted to read data that isn't there. 在I / O操作期间,只有在尝试读取不存在的数据时,您才会检测到EOF。 The fread() won't report EOF; fread()不会报告EOF; you asked to read what was in the file. 您要求读取文件中的内容。 If you tried getc(fp) after the fread() , you'd get EOF unless the file had grown since you measured how long it is. 如果您在fread() getc(fp)之后尝试了getc(fp) ,那么除非文件由于测量了文件的长度而变大,否则将获得EOF。 Since _wfopen_s() is a non-standard function, it might be affecting how ftell() behaves and the value it reports. 由于_wfopen_s()是非标准函数,因此可能会影响ftell()行为方式和报告的值。 (But you later established that wasn't the case.) (但是您后来发现事实并非如此。)

  • Note that functions such as fgetc() or getchar() are defined to return characters as positive integers and EOF as a distinct negative value. 请注意,诸如fgetc()getchar()类的函数已定义为以正整数形式返回字符,而以不同的负值形式返回EOF。

    If the end-of-file indicator for the input stream pointed to by stream is not set and a next character is present, the fgetc function obtains that character as an unsigned char converted to an int . 如果输入流中的结束文件指针指向的stream没有设置和下一个字符存在,则fgetc函数获取字符作为unsigned char转换为int

    If the end-of-file indicator for the stream is set, or if the stream is at end-of-file, the end-of- file indicator for the stream is set and the fgetc function returns EOF. 如果设置了流的文件结束指示符,或者流在文件末尾,则设置了流的文件结束指示符,并且fgetc函数返回EOF。 Otherwise, the fgetc function returns the next character from the input stream pointed to by stream . 否则, fgetc函数返回从输入流中的下一个字符被指向stream If a read error occurs, the error indicator for the stream is set and the fgetc function returns EOF. 如果发生读取错误,将设置流的错误指示符,并且fgetc函数将返回EOF。 289) 289)

    289) An end-of-file and a read error can be distinguished by use of the feof and ferror functions. 289)通过使用feofferror函数可以区分文件结束和读取错误。

    This indicates how EOF is separate from any valid character in the context of I/O operations. 这表明在I / O操作的上下文中EOF如何与任何有效字符分开。

You comment: 您评论:

As for any potential memory leakage... At this stage in my project, memory leaks are one of many problems with my code which, as of yet, are of no concern to me. 至于任何潜在的内存泄漏...在我的项目的现阶段,内存泄漏是我的代码存在的许多问题之一,到目前为止,我仍然不关心它们。 Even if it didn't leak memory, it doesn't even work to begin with, so what's the point? 即使它没有泄漏内存,也从一开始就不起作用,那又有什么意义呢? Functionality comes first. 功能至上。

It is easier to head off memory leaks in error paths at the initial coding stage than to go back later and fix them — because you may not spot them because you may not trigger the error condition. 在最初的编码阶段,避免在错误路径中出现内存泄漏比以后再修复它们更容易-因为您可能没有发现它们,因为您可能不会触发错误情况。 However, the extent to which that matters depends on the intended audience for the program. 但是,重要的程度取决于计划的目标受众。 If it is a one-off for a coding course, you may be fine. 如果这是一次性的编码课程,则可能会很好。 If you're the only person who'll use it, you may be fine. 如果您是唯一使用它的人,则可能会很好。 But if it will be installed by millions, you'll have problems retrofitting the checks everywhere. 但是,如果将要安装数以百万计的设备,则到处都会加装检查设备。

I have swapped _wfopen_s() with fopen() and the result from ftell() is the same. 我已经将_wfopen_s()与fopen()交换了,而ftell()的结果是相同的。 However, after changing the corresponding lines to LPSTR s = new char[fileSize + 1], RtlZeroMemory(s, sizeof(char) * fileSize + 1); 但是,将相应的行更改为LPSTR后,s = new char [fileSize + 1],RtlZeroMemory(s,sizeof(char)* fileSize + 1); (which should also null-terminate it, btw), and adding if(nPos == fileSize) to the top of the loop, it now comes out cleanly. (顺便说一句,也应该以null终止),并将if(nPos == fileSize)添加到循环的顶部,现在它可以清晰地显示出来了。

OK. 好。 You could use just s[fileSize] = '\\0'; 您可以只使用s[fileSize] = '\\0'; to null terminate the data too, but using RtlZeroMemory() achieves the same effect (but would be slower if the file is many megabytes in size). 为null也可以终止数据,但是使用RtlZeroMemory()可以达到相同的效果(但是如果文件大小为数MB,则速度会较慢)。 But I'm glad the various comments and suggestions helped get you back on track. 但我很高兴收到各种评论和建议,使您重回正轨。


* In theory, CHAR_BITS might be larger than 8; *理论上,CHAR_BITS可能大于8; in practice it is almost always 8 and for simplicity, I'm assuming it is 8 bits here. 实际上,它几乎总是8,为简单起见,我假设这里是8位。 The discussion has to be more nuanced if CHAR_BITS is 9 or more, but the net effect is much the same. 如果CHAR_BITS为9或更大,则讨论必须更加细微,但最终效果几乎相同。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM