简体   繁体   English

FileInfo.Length!=所有行长度的总和

[英]FileInfo.Length != sum of all line length

I'm trying to make a progress bar for big file's reading. 我正在尝试为大文件阅读制作进度条。 I set the progress bar's maximum value to FileInfo.Length , I read each line using StreamReader.ReadLine and compute the sum of each line length (with String.Length ) to set the progress bar's current value. 我将进度条的最大值设置为FileInfo.Length ,我使用StreamReader.ReadLine读取每一行并计算每个行长度的总和(使用String.Length )来设置进度条的当前值。

What I noticed is that there is a difference between the file's total length and the sum of the length of each line. 我注意到文件的总长度和每行的长度之和存在差异。 For example : FileInfo.Length = 25577646 Sum of all line length = 25510563 例如: FileInfo.Length = 25577646所有行长度的总和= 25510563

Why is there such a difference ? 为什么会有这样的差异?

Thanks for your help ! 谢谢你的帮助 !

You aren't adding the end-of-lines. 您没有添加行尾。 It could be from 1 to 4 bytes, depending on the encoding or if it is a \\n or a \\r or a \\r\\n (1 byte = UTF8 + \\n , 4 bytes = UTF16 + \\r\\n ) 它可以是从1到4个字节,这取决于编码,或者如果它是一个\\n\\r\\r\\n (1个字节= UTF8 + \\n ,4个字节= UTF16 + \\r\\n

Note that with ReadLine it isn't possible to check which end-of-line ( \\n or \\r or \\r\\n it encountered) 请注意,使用ReadLine ,无法检查哪个行尾( \\n\\r\\r\\n遇到它)

From ReadLine : 来自ReadLine

A line is defined as a sequence of characters followed by a line feed ("\\n"), a carriage return ("\\r"), or a carriage return immediately followed by a line feed ("\\r\\n") 一行被定义为一个字符序列,后跟一个换行符(“\\ n”),一个回车符(“\\ r”),或一个回车符后面紧跟一个换行符(“\\ r \\ n”)

Other problem: if your file is UTF8, then C# char length is different from byte length: è is one char in C# (that uses UTF16), 2 chars in UTF8. 其他问题:如果你的文件是UTF8,那么C#char长度与字节长度不同: è是C#中的一个char(使用UTF16),UTF8中有2个字符。 You could: 你可以:

int len = Encoding.UTF8.GetByteCount(line);

Two problems here: 这里有两个问题:

  • string.Length gives you the number of characters in each string, whereas FileInfo.Length gives you the number of bytes . string.Length为您提供每个字符串中的字符数,而FileInfo.Length为您提供字节数 Those can be very different things, depending on the characters and the encoding used 这些可能是非常不同的东西,取决于使用的字符和编码
  • You're not including the line breaks (typically \\n or \\r\\n ) as those are removed when reading lines with TextReader.ReadLine 您没有包含换行符(通常为\\n\\r\\n ),因为在使用TextReader.ReadLine读取行时会删除换行符

In terms of what to do about this... 关于如何做到这一点......

  • You presumably know the file's encoding, so you could convert each line back into bytes by calling Encoding.GetBytes to account for that difference. 您可能知道文件的编码,因此您可以通过调用Encoding.GetBytes将每行重新转换为字节来解释该差异。 It would be pretty wasteful to do this though. 尽管这样做会非常浪费。
  • If you know the line break used by the file, you could just add the relevant number of bytes for each line you read 如果您知道文件使用的换行符,则可以为您读取的每一行添加相关的字节数
  • You could keep a reference to the underlying stream and use Stream.Position to detect how far through the file you've actually read. 您可以保留对基础流的引用,并使用Stream.Position来检测您实际读取的文件的距离。 That won't necessarily be the same as the amount of data you've processed though, as the StreamReader will have a buffer. 这不一定与您处理的数据量相同,因为StreamReader将具有缓冲区。 (So you may well "see" that the Stream has read all the data even though you haven't processed all the lines yet.) (因此,即使您尚未处理所有行,您也可以“看到” Stream已读取所有数据。)

The last idea is probably the cleanest, IMO. 最后一个想法可能是最干净的IMO。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM