简体   繁体   English

C#:解析文本文件

[英]C# : parsing text file

I've text file, file's content are something like this : 我是文本文件,文件的内容是这样的:

idiom: meaning
description.
o example1.
o example2.

idiom: meaning
description.
o example1.
o example2.

.
.
.

as you can see that file contains above paragraphs, each paragraph has some data that I want to extract (note that examples start with o ). 正如您可以看到该文件包含上述段落,每个段落都有一些我想要提取的数据(请注意,示例以o开头)。 for example we've these data : 例如,我们有这些数据:

public class Idiom
{
    public string Idiom { get; set; }
    public string Meaning { get; set; }
    public string Description { get; set; }
    public IList<IdiomExample> IdiomExamples { get; set; }
}

public class IdiomExample
{
    public string Item { get; set; }
}

Is there any way to extract those fields in that file? 有没有办法在该文件中提取这些字段? Any Idea? 任何想法?

Edited 编辑
that file could be anything, something like idiom and verb,... are example , that is just my pattern for example : 该文件可以是任何东西,例如成语和动词,......例如,这只是我的模式,例如:

little by little: gradually, slowly (also: step by step)
o Karen's health seems to be improving little by little.
o If you study regularly each day, step by step your vocabulary will increase.
to tire out: to make very weary due to difficult conditions or hard effort (also: to wear out) (S)
o The hot weather tired out the runners in the marathon.
o Does studying for final exams wear you out? It makes me feel worn out!

Thanks in advance 提前致谢

Here is my regex for your problem: 这是我的问题的正则表达式:

(?<section>(?<idiom>^.+?):(?<meaning>.+)[\n](?<description>.*?)(?<examples>(?<example>o.+[\s\r\n])+))

I tested it a little bit, but i think that you'll have to fix some little problems. 我测试了一下,但我认为你必须解决一些小问题。 In general, it works well. 一般来说,它运作良好。

Settings for this regex: 此正则表达式的设置:

RegexOptions.IgnoreCase | RegexOptions.Multiline | RegexOptions.ExplicitCapture | RegexOptions.IgnorePatternWhitespace | RegexOptions.CultureInvariant

Well, you have 3 ways to work with your file. 那么,您有3种方法可以处理您的文件。 First is to use regex, it's the quickiest in position of development and slowest in performance solution. 首先是使用正则表达式,它是开发中最快的,性能解决方案中最慢的。 The second is to parse your text into string and use LINQ or whatever you want. 第二种是将文本解析为字符串并使用LINQ或任何你想要的。 This approach, for me, is buggy, non-scaleable and so on, but it has better performance, which can be critical if you deal with very huge files. 对我来说,这种方法是错误的,不可扩展的等等,但它具有更好的性能,如果你处理非常庞大的文件,这可能是至关重要的。 And the third is to use formal grammars and terminal machines or something like that... I have never implemented such a stuff, but i know, that it is fast and very hard to develop and maintain, so i recommend you to use regexps and then migrate to another approach if performance will become your bottleneck 第三种是使用正式的语法和终端机器或类似的东西......我从来没有实现过这样的东西,但我知道,开发和维护起来很快很难,所以我建议你使用正则表达式和如果性能成为你的瓶颈,那么转移到另一种方法

Hope this helps! 希望这可以帮助!

Your example has no description but this regexp accepts optional description. 您的示例没有说明,但此正则表达式接受可选说明。 It gives you an idea how to parse your input not the whole C# code. 它让您了解如何解析输入而不是整个C#代码。

See here this demo and look at the Groups 请参阅此演示并查看各组

(?smx)
^ 
([^:\n]+):\s*([^\n]+)
\n([^o].*?\n|)
(^o.*?)
(?=\Z|^[^o:\n]+:)

After this: 在这之后:

  1. Group#1 has idiom 第一组有成语

  2. Group#2 has meaning 第2组有意义

  3. Group#3 has description if present 组#3具有描述(如果存在)

  4. Group#4 has all the examples 第4组有所有例子

This regex does not parse your examples into several examples, that is the next job. 这个正则表达式不会将您的示例解析为几个示例,即下一个作业。 Also you may don't like some newlines. 你可能也不喜欢一些换行符。

沿着这些方向的东西(没有测试它,这只是一个建议)

RegEx r = new RegEx(@"Idiom:([^\n]+)\n([^o]+)(o([^o]+)o)*");

Something like this should work. 这样的事情应该有效。 I haven't tested it, but with a little debug I guess it would work. 我没有测试它,但通过一点调试我想它会起作用。

I know you put regex in tags, but this is a way for extracting line too. 我知道你把regex放在标签中,但这也是一种提取线的方法。

using ( var textReader = new StreamReader("idioms.txt") )
{
    var idioms = new List<Idiom>();
    string line;
    while ( ( line = textReader.ReadLine() ) != null )
    {
        var idiom = new Idiom();
        if ( line.StartsWith("idiom: ") )
        {
            idiom.Meaning = line.Replace("idiom: ", string.Empty);
            idiom.Description = textReader.ReadLine();

            while ( ( line = textReader.ReadLine() ) != null )
            {
                if ( line.StartsWith("o ") )
                    idiom.IdiomExamples.Add(new IdiomExample { Item = line.Replace("o ", string.Empty) });
                else break;
            }
            idioms.Add(idiom);
        }
    }

    ///idioms ready
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM