简体   繁体   English

Ifstream从文本文件中读取错误的字符

[英]Ifstream reads wrong characters from text file

I have the following simple code, that reads contents of a text file into array of chars: 我有以下简单的代码,它将文本文件的内容读入字符数组:

const char* name = "test.txt";
std::cout << "Loading file " << name << std::endl;
std::ifstream file;
file.open(name);
file.seekg (0, std::ios::end);
int length = file.tellg();
std::cout << "Size: " << length << " bytes" << std::endl;
file.seekg (0, std::ios::beg);
char* buffer = new char[length];
file.read(buffer,length);
file.close();
std::cout.write(buffer,length);

However, it seems ifstream reads wrong number of chars from the file: 1 additional char for each line. 但是,似乎ifstream从文件中读取了错误的字符数:每行增加1个字符。 I searched through the web and it looks like in win7 text files have carriage return symbol (\\r) in addition to newline (\\n) in the end of each line. 我在网上搜索,看起来在win7文本文件中,每行的末尾除了换行符(\\ n)之外还带有回车符(\\ r)。 However, the stream somehow does not see these \\r, but still uses the original number of symbols in the file, reading additional bytes from beyond the end of the file. 但是,流以某种方式看不到这些\\ r,但仍使用文件中原始数量的符号,并从文件末尾读取其他字节。 Is it possible to somehow solve this problem? 有可能以某种方式解决这个问题吗?

If it helps: I use MinGW compiler and Windows 7 64bit. 如果有帮助:我使用MinGW编译器和Windows 7 64bit。

You might want to open the file in binary mode: 您可能要以二进制模式打开文件:

file.open(name, ios_base::in | ios_base::binary);

Otherwise what happens is that the standard library translates every Windows newline (CR+LF) into a single \\n for you. 否则,标准库将为您将每个Windows换行符(CR + LF)转换为一个\\n

This means that the number of characters that you can read from the file is not the same as the size of the file. 这意味着您可以从文件中读取的字符数与文件的大小不同。 When you call read() , it reads as many characters as it can. 当您调用read() ,它将读取尽可能多的字符。 If it can't read the number of characters you requested, it sets the stream's failbit . 如果无法读取您请求的字符数,则会设置流的failbit

了解有关打开文件进行binary读取的信息(google或在此处查看 )。

You're starting from some very erroneous (but widespread) opinions. 您是从一些非常错误(但普遍)的观点开始的。 file.tellg() doesn't return an int ; file.tellg()不返回int ; it returns an implementation defined object of type streampos , which must be a class type, and may or may not be convertible into an integral type. 它返回类型为streampos的实现定义的对象,该对象必须是类类型,并且可以转换为整数类型,也可以不转换为整数类型。 And if it is convertable into an integral type (and I don't know of an implementation where it isn't, even if it is not required), there is no guarantee that the resulting integer represents anything more than a magic cookie which would allow reseeking to the same position. 而且,如果可以将其转换为整数类型(并且即使不需要,我也不知道该实现在哪里),则不能保证所得的整数所代表的内容不只是魔术cookie,允许重新寻找到相同的位置。

In practice, this is probably not a big issue on modern machines: both Unix and Windows return the offset in bytes from the start of the file. 实际上,在现代计算机上这可能不是什么大问题:Unix和Windows都从文件的开头返回以字节为单位的偏移量。 In the case of Unix, this works fine, because the mapping of the internal representation to the external one is one to one. 在Unix上,这很好用,因为内部表示与外部表示的映射是一对一的。 In the case of Windows, there is a remapping of line endings: in a text file, a line ending is a two byte sequence of 0x0D, 0x0A, which becomes, when read, the single char '\\n' . 在Windows中,行尾重新映射:在文本文件中,行尾是两个字节的序列0x0D,0x0A,在读取时变成单个char '\\n' And streampos (converted to an integral type) gives the offset in bytes to where you have to seek in the file, and not the number of char you have to read to get to that position. streampos (转换为整数类型)给出了偏移量(以字节为单位),该偏移量是您必须在文件中查找的位置,而不是到达该位置必须读取的char数。 For things like what you seem to be doing, this is not a problem; 对于您似乎正在做的事情,这不是问题。 the allocated buffer may be a little larger than necessary, but it will never be too small. 分配的缓冲区可能比必要的要大一些,但是永远不会太小。

Be aware that this may not be true on mainframes. 请注意,在大型机上可能并非如此。 Historically, at least, mainframes used block oriented files, and the integral value of a streampos could easily be something broken up into fields, with a certain number of bits for the block number, and other bits for the byte offset in the block. 从历史上看,至少,大型机使用面向块的文件,以及在的积分值streampos可以很容易地东西分成领域,具有一定数目的位的块号,以及用于字节块中的偏移的其他位。 Depending on how these are laid out in the word, a buffer allocated as you do could easily be several orders of magnitude too big, or if the offset is placed on the high order bits, too small. 根据在字中的排列方式,按您的方式分配的缓冲区可能很容易变得过大几个数量级,或者如果将偏移量放在高阶位上也可能太小。

The only reliable way of getting the exact size of buffer you need is system dependent, and on some systems (including Windows), there may be no other way except by reading all of the characters and counting them. 获取所需的确切缓冲区大小的唯一可靠方法取决于系统,并且在某些系统(包括Windows)上,除了读取所有字符并对其进行计数外,可能没有其他方法。

(The reason streampos is required to be a class type is because, historically, many older multibyte encodings had an encoding state; you couldn't correctly decode a character without knowing what characters preceded it. So streampos is required to contain two different information: the position to seek in the file, and information about this state. I don't think that there are any state dependent multibyte encodings in wide use today, however.) (之所以要求streampos为类类型是因为,从历史上看,许多较旧的多字节编码都具有编码状态;您不知道字符之前是什么字符就无法正确对其进行解码。因此, streampos必须包含两个不同的信息:文件中要查找的位置以及有关此状态的信息。不过,我认为今天没有广泛使用任何与状态相关的多字节编码。)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM