简体   繁体   English

跳过一些内容在C#中解析文本文件

[英]Parsing a textfile in C# with skipping some contents

I'm trying to parse a text file that has a heading and the body. 我正在尝试解析具有标题和正文的文本文件。 In the heading of this file, there are line number references to sections of the body. 在此文件的标题中,有引用到正文各部分的行号。 For example: 例如:

SECTION_A 256
SECTION_B 344
SECTION_C 556

This means, that SECTION_A starts in line 256. 这意味着SECTION_A从256行开始。

What would be the best way to parse this heading into a dictionary and then when necessary read the sections. 将标题解析为字典,然后在必要时阅读各节的最佳方法是什么。

Typical scenarios would be: 典型方案为:

  1. Parse the header and read only section SECTION_B 解析标头,并只读SECTION_B节
  2. Parse the header and read fist paragraph of each section. 解析标题并阅读每个部分的第一段。

The data file is quite large and I definitely don't want to load all of it to the memory and then operate on it. 数据文件很大,我绝对不希望将所有文件加载到内存中然后对其进行操作。

I'd appreciate your suggestions. 非常感谢您的建议。 My environment is VS 2008 and C# 3.5 SP1. 我的环境是VS 2008和C#3.5 SP1。

You can do this quite easily. 您可以很容易地做到这一点。

There are three parts to the problem. 这个问题分为三个部分。

1) How to find where a line in the file starts. 1)如何查找文件中一行的开始位置。 The only way to do this is to read the lines from the file, keeping a list that records the start position in the file of that line. 唯一的方法是从文件中读取各行,并保留一个记录该行的开始位置的列表。 eg 例如

List lineMap = new List();
lineMap.Add(0);    // Line 0 starts at location 0 in the data file (just a dummy entry)
lineMap.Add(0);    // Line 1 starts at location 0 in the data file

using (StreamReader sr = new StreamReader("DataFile.txt")) 
{
    String line;
    int lineNumber = 1;
    while ((line = sr.ReadLine()) != null)
        lineMap.Add(sr.BaseStream.Position);
}

2) Read and parse your index file into a dictionary. 2)阅读索引文件并将其解析为字典。

Dictionary index = new Dictionary();

using (StreamReader sr = new StreamReader("IndexFile.txt")) 
{
    String line;
    while ((line = sr.ReadLine()) != null)
    {
        string[] parts = line.Split(' ');  // Break the line into the name & line number
        index.Add(parts[0], Convert.ToInt32(parts[1]));
    }
}

Then to find a line in your file, use: 然后在文件中查找一行,使用:

int lineNumber = index["SECTION_B";];         // Convert section name into the line number
long offsetInDataFile = lineMap[lineNumber];  // Convert line number into file offset

Then open a new FileStream on DataFile.txt, Seek(offsetInDataFile, SeekOrigin.Begin) to move to the start of the line, and use a StreamReader (as above) to read line(s) from it. 然后在DataFile.txt,Seek(offsetInDataFile,SeekOrigin.Begin)上打开新的FileStream移至行的开头,并使用StreamReader(如上)从中读取行。

Well, obviously you can store the name + line number into a dictionary, but that's not going to do you any good. 好吧,显然您可以将名称+行号存储到字典中,但这对您没有任何好处。

Well, sure, it will allow you to know which line to start reading from, but the problem is, where in the file is that line? 好吧,可以肯定,它将使您知道从哪一行开始读取,但是问题是,该行在文件中的什么位置? The only way to know is to start from the beginning and start counting. 唯一知道的方法是从头开始并开始计数。

The best way would be to write a wrapper that decodes the text contents (if you have encoding issues) and can give you a line number to byte position type of mapping, then you could take that line number, 256, and look in a dictionary to know that line 256 starts at position 10000 in the file, and start reading from there. 最好的方法是编写一个包装程序,该包装程序对文本内容进行解码(如果存在编码问题),并且可以给您一个行号到字节位置类型的映射,那么您可以将该行号取为256,然后查看字典知道行256从文件中的位置10000开始,并从那里开始读取。

Is this a one-off processing situation? 这是一次性的情况吗? If not, have you considered stuffing the entire file into a local database, like a SQLite database? 如果不是,您是否考虑过将整个文件填充到本地数据库(如SQLite数据库)中? That would allow you to have a direct mapping between line number and its contents. 这样,您就可以在行号与其内容之间建立直接映射。 Of course, that file would be even bigger than your original file, and you'd need to copy data from the text file to the database, so there's some overhead either way. 当然,该文件将比原始文件更大,并且您需要将数据从文本文件复制到数据库,因此这两种方法都会有一些开销。

Just read the file one line at a time and ignore the data until you get to the ones you need. 一次只读取一行文件,然后忽略数据,直到获得所需的数据为止。 You won't have any memory issues, but performance probably won't be great. 您不会有任何内存问题,但是性能可能不会很好。 You can do this easily in a background thread though. 不过,您可以在后台线程中轻松完成此操作。

Read the file until the end of the header, assuming you know where that is. 假设您知道文件的位置,请读取文件直到标题的末尾。 Split the strings you've stored on whitespace, like so: 分割存储在空白处的字符串,如下所示:

Dictionary<string, int> sectionIndex = new Dictionary<string, int>();
List<string> headers = new List<string>(); // fill these with readline

foreach(string header in headers) {
    var s = header.Split(new[]{' '});
    sectionIndex.Add(s[0], Int32.Parse(s[1]));
}

Find the dictionary entry you want, keep a count of the number of lines read in the file, and loop until you hit that line number, then read until you reach the next section's starting line. 找到所需的词典条目,对文件中读取的行数进行计数,然后循环直到找到该行号,然后进行读取,直到到达下一部分的起始行。 I don't know if you can guarantee the order of keys in the Dictionary, so you'd probably need the current and next section's names. 我不知道您是否可以保证字典中键的顺序,因此您可能需要当前和下一节的名称。

Be sure to do some error checking to make sure the section you're reading to isn't before the section you're reading from, and any other error cases you can think of. 确保进行一些错误检查,以确保您要阅读的部分不在您要阅读的部分之前,以及可以想到的其他错误情况。

You could read line by line until all the heading information is captured and stop (assuming all section pointers are in the heading). 您可以逐行阅读,直到捕获所有标题信息并停止(假设所有节指针都在标题中)。 You would have the section and line numbers for use in retrieving the data at a later time. 您将具有节号和行号,以供以后检索数据时使用。

string dataRow = "";

try
{
    TextReader tr = new StreamReader("filename.txt");

    while (true)
    {
        dataRow = tr.ReadLine();
        if (dataRow.Substring(1, 8) != "SECTION_")
            break;
        else
            //Parse line for section code and line number and log values
            continue;
    }
    tr.Close();
}
catch (Exception ex)
{
    MessageBox.Show(ex.Message);
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM