简体   繁体   中英

C# : parsing text file

I've text file, file's content are something like this :

idiom: meaning
description.
o example1.
o example2.

idiom: meaning
description.
o example1.
o example2.

.
.
.

as you can see that file contains above paragraphs, each paragraph has some data that I want to extract (note that examples start with o ). for example we've these data :

public class Idiom
{
    public string Idiom { get; set; }
    public string Meaning { get; set; }
    public string Description { get; set; }
    public IList<IdiomExample> IdiomExamples { get; set; }
}

public class IdiomExample
{
    public string Item { get; set; }
}

Is there any way to extract those fields in that file? Any Idea?

Edited
that file could be anything, something like idiom and verb,... are example , that is just my pattern for example :

little by little: gradually, slowly (also: step by step)
o Karen's health seems to be improving little by little.
o If you study regularly each day, step by step your vocabulary will increase.
to tire out: to make very weary due to difficult conditions or hard effort (also: to wear out) (S)
o The hot weather tired out the runners in the marathon.
o Does studying for final exams wear you out? It makes me feel worn out!

Thanks in advance

Here is my regex for your problem:

(?<section>(?<idiom>^.+?):(?<meaning>.+)[\n](?<description>.*?)(?<examples>(?<example>o.+[\s\r\n])+))

I tested it a little bit, but i think that you'll have to fix some little problems. In general, it works well.

Settings for this regex:

RegexOptions.IgnoreCase | RegexOptions.Multiline | RegexOptions.ExplicitCapture | RegexOptions.IgnorePatternWhitespace | RegexOptions.CultureInvariant

Well, you have 3 ways to work with your file. First is to use regex, it's the quickiest in position of development and slowest in performance solution. The second is to parse your text into string and use LINQ or whatever you want. This approach, for me, is buggy, non-scaleable and so on, but it has better performance, which can be critical if you deal with very huge files. And the third is to use formal grammars and terminal machines or something like that... I have never implemented such a stuff, but i know, that it is fast and very hard to develop and maintain, so i recommend you to use regexps and then migrate to another approach if performance will become your bottleneck

Hope this helps!

Your example has no description but this regexp accepts optional description. It gives you an idea how to parse your input not the whole C# code.

See here this demo and look at the Groups

(?smx)
^ 
([^:\n]+):\s*([^\n]+)
\n([^o].*?\n|)
(^o.*?)
(?=\Z|^[^o:\n]+:)

After this:

  1. Group#1 has idiom

  2. Group#2 has meaning

  3. Group#3 has description if present

  4. Group#4 has all the examples

This regex does not parse your examples into several examples, that is the next job. Also you may don't like some newlines.

沿着这些方向的东西(没有测试它,这只是一个建议)

RegEx r = new RegEx(@"Idiom:([^\n]+)\n([^o]+)(o([^o]+)o)*");

Something like this should work. I haven't tested it, but with a little debug I guess it would work.

I know you put regex in tags, but this is a way for extracting line too.

using ( var textReader = new StreamReader("idioms.txt") )
{
    var idioms = new List<Idiom>();
    string line;
    while ( ( line = textReader.ReadLine() ) != null )
    {
        var idiom = new Idiom();
        if ( line.StartsWith("idiom: ") )
        {
            idiom.Meaning = line.Replace("idiom: ", string.Empty);
            idiom.Description = textReader.ReadLine();

            while ( ( line = textReader.ReadLine() ) != null )
            {
                if ( line.StartsWith("o ") )
                    idiom.IdiomExamples.Add(new IdiomExample { Item = line.Replace("o ", string.Empty) });
                else break;
            }
            idioms.Add(idiom);
        }
    }

    ///idioms ready
}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM