简体   繁体   English

用Linq解析文本数据文件

[英]Parsing Text Data File With Linq

I have a large text file of records, each delimited by a newline. 我有一个大的记录文本文件,每个记录都由换行符分隔。 Each record is prefixed by a two digit number which specifies it's type. 每条记录均以两位数字作为前缀,以指定其类型。 Here's an example: 这是一个例子:

....

30AA ALUMINIUM ALLOY     LMELMEUSD2.00  0.35         5101020100818
40AADFALUMINIUM ALLOY USD USD 100   1       0.20000    1.00   0 100  140003
50201008180.999993  0.00  0.00  120100818
60       0F     1  222329 1.000000      0      0  -4667  -4667   4667   4667
50201008190.999986  0.00  0.00  120100819
60       0F     1  222300 1.000000      0      0  -4667  -4667   4667   4667
40AADOALUMINIUM ALLOY USD USD 100   1       0.20000    1.00   0 100  140001
50201009150.000000  0.17  0.17  120100915
60    1200C     1  101779 0.999800      0      0  -4666  -4666   4665   4665
60    1200P     1       0 0.000000      0      0      0      0      0      0
60    1225C     1   99279 0.999800     -1     -1  -4667  -4667   4665   4665
60    1225P     1       0 0.000000      0      0      0      0      0      0
60    1250C     1   96780 0.999800      0      0  -4666  -4666   4665   4665
60    1250P     1       0 0.000000      0      0      0      0      0      0
60    1275C     1   94280 0.999800     -1     -1  -4667  -4667   4665   4665
60    1275P     1       0 0.000000      0      0      0      0      0      0
60    1300C     1   91781 0.999800      0      0  -4666  -4666   4665   4665
60    1300P     1       0 0.000000

.......

The file contains a hierarchical relationship, based on the two digit prefixes. 该文件包含基于两位数字前缀的层次结构关系。 You can think of the "30" lines containing "40" lines as it's children; 您可以将包含“ 40”行的“ 30”行视为其子级。 "40" lines containing "50", and "50"s containing "60"s. “ 40”行包含“ 50”,“ 50”行包含“ 60”。 After parsing, these lines and their associated prefixes will obviously map to a clr type, "30"s mapping to "ContractGroup", "40"s mapping to "InstrumentTypeGroup" "50"s mapping to "ExpirationGroup" etc. 解析后,这些行及其关联的前缀显然将映射为clr类型,“ 30”映射为“ ContractGroup”,“ 40”映射为“ InstrumentTypeGroup”,“ 50”映射为“ ExpirationGroup”。

I'm attempting to take a functional approach to the parse, as well as reducing memory consumption with a lazy load approach, since this file is extremely large. 我试图采用一种实用的方法进行解析,并通过延迟加载方法减少内存消耗,因为此文件非常大。 My first step is in creating a generator to yield one line at a time, something like this: 我的第一步是创建一个生成器,一次生成一行,如下所示:

 public static IEnumerable<string> TextFileLineEnumerator()
 {
     using (StreamReader sr = new StreamReader("BigDataFile.txt"))
     {
         while (!sr.EndOfStream)
         {
             yield return sr.ReadLine();
         }
     }
 }

This allows me to Linq against the text file, and process the lines as a stream. 这使我可以针对文本文件使用Linq,并将这些行作为流处理。

My problem is attempting to process this stream into it's compositional collection structure, here's a first attempt: 我的问题是尝试将此流处理为它的成分收集结构,这是第一次尝试:

  var contractgroups =   from strings in TextFileLineEnumerator()
                          .SkipWhile(s => s.Substring(0, 2) != "30")
                            .Skip(1) where strings.Substring(0,2) != "30"
                              select strings;

This gives me all child lines of "30" (but unfortunately omits the "30" line itself.) This query will obviously require subqueries to gather and project the lines (via a select) into their appropriate types, with appropriate compositions (ContractGroups containing a List of InstrumentTypeGroups, etc.) 这给了我所有的“ 30”子行(但不幸的是省略了“ 30”行本身。)显然,此查询将需要子查询来收集(通过选择)这些行并将其投影到它们的适当类型中,并具有适当的组成(ContractGroups包含InstrumentTypeGroups等的列表)

This problem more than likely boils down to my lack of experience with functional programming, so if anyone has any pointers on this sort of parsing, that would be helpful, thanks- 这个问题很可能归结为我对函数式编程缺乏经验,因此,如果有人对此类解析有任何指示,那将是有帮助的,谢谢-

It's not totally clear to me exactly what you're trying to do, but how I would approach this problem would be to first write a PartitionLines function like this: 对我来说,您到底想做什么并不完全清楚,但是我要如何解决这个问题将是首先编写一个PartitionLines函数,如下所示:

public static IEnumerable<IEnumerable<string>> PartitionLines(
    this IEnumerable<string> source,
    Func<string, string> groupMarkerSelector,
    string delimeter)
{
    List<string> currentGroup = new List<string>();

    foreach (string line in source)
    {
        var key = groupMarkerSelector(line);
        if (delimeter == key && currentGroup.Count > 0)
        {
            yield return currentGroup;
            currentGroup = new List<string>();
        }

        currentGroup.Add(line);
    }

    if (currentGroup.Count > 0)
        yield return currentGroup;
}

(Note that my function loads a "group" at time into memory; I assume this is OK.) (请注意,我的函数有时将一个“组”加载到内存中;我认为这是可以的。)

I'd then take something like this: 然后我会采取这样的事情:

var line30Groups =
    TextFileLineEnumerator().
    PartitionLines(l => l.Substring(0, 2), "30");

Now you've got the lines in groups, with a new group of lines starting each time you see a "30." 现在,您已经将这些行分成几组,每次看到“ 30”时,就会出现一组新的行。 You could subdivide further: 您可以进一步细分:

var line3040Groups =
    TextFileLineEnumerator().
    PartitionLines(l => l.Substring(0, 2), "30").Select(g =>
        g.PartitionLines(l => l.Substring(0, 2), "40"));

Now you've got the lines in groups under the "30", and each group is an enumerable of groups under each child "40." 现在,您已经在“ 30”下的组中找到了行,并且每个组都是每个“ 40”下的组的枚举。 And so on. 等等。

This is untested and could be cleaner, but you get the picture, I hope. 这未经测试,可能会更清洁,但我希望您能明白。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM