简体   繁体   English

StreamReader.ReadLine()行为很奇怪

[英]StreamReader.ReadLine() very strange behaviour

I have a delimited file with a few thousand lines in it, and I wrote a method to automatically detect the delimiter. 我有一个带有数千行的定界文件,并且编写了一种自动检测定界符的方法。

The method looks like this: 该方法如下所示:

private bool TryDetermineDelimiter(FileInfo target, out char delimiter)
        {
            char[] possibleDelimiters = new char[] { ',', ';', '-', ':' };

            using (StreamReader sr = new StreamReader(target.OpenRead()))
            {
                List<int> delimiterHits = new List<int>();

                foreach (char del in possibleDelimiters)
                {


                    while (!sr.EndOfStream)
                    {
                        var line = sr.ReadLine();
                        var matches = Regex.Matches(line, $"{del}(?=(?:[^\"]*\"[^\"]*\")*[^\"]*$)");

                        if(matches.Count == 0)
                        {
                            sr.BaseStream.Seek(0, SeekOrigin.Begin);
                            break;
                        }

                        delimiterHits.Add(matches.Count);
                    }

                    if (delimiterHits.Any(d => d != delimiterHits[0]) || delimiterHits.Count == 0)
                    {
                        delimiterHits.Clear();
                        continue;
                    }

                    delimiter = del;
                    return true;
                }
            }

            delimiter = ',';
            return false;
        }

There is a strange thing happening, where at the 5th line, the call to sr.ReadLine() is returning the 5th line with the 1st line concatenated 发生了一件奇怪的事情,在第5行,对sr.ReadLine()的调用返回了第1行串联的第5行

So for example: 因此,例如:

delimited file: 分隔文件:

col1; col2; col3; col4
val1; val2; val3; val4
val5; val6; val7; val8
...

The first 4 calls to StreamReader.ReadLine() return the expected lines but the 5th call returns: val13; val14; val15; val16; col1; col2; col3; col4; StreamReader.ReadLine()的前4个调用返回预期的行, 第5个调用返回: val13; val14; val15; val16; col1; col2; col3; col4; val13; val14; val15; val16; col1; col2; col3; col4;

Stepping through, I can confirm that the loop never enters the if(matches.Count == 0) block, the correct number of delimiters is found each iteration. 逐步执行,我可以确认循环永远不会进入if(matches.Count == 0)块,每次迭代都会找到正确数量的定界符。

Unfortunately I can't post the contents of the actual file because it may get me in trouble, but I have ensured there is no fishy business with the line endings or other characters. 不幸的是我不能发布的实际文件的内容,因为它可以让我麻烦,但我确保没有可疑的业务与行尾或其它字符。 The file is as expected. 该文件是预期的。

I should also mention that this bug does not occur with comma separated values, only with semicolons . 我还应该提到,此错误不会出现在用逗号分隔的值上,而只会出现在分号上

Change your code to this 将您的代码更改为此

if (matches.Count == 0)
{
    sr.BaseStream.Seek(0, SeekOrigin.Begin);
    sr.DiscardBufferedData();
    break;
}

By instructing the StreamReader to discard its buffer, you're instructing it to synchronize with the actual base stream. 通过指示StreamReader丢弃其缓冲区,即指示其与实际基本流进行同步。

Other than that, the lines returned aren't concatenated, but it is looping back on its self, though what I've shown above will fix that 除此之外,返回的行没有被连接,但是它循环返回自己,尽管我上面显示的内容可以解决该问题

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM