简体   繁体   中英

Regex matching chunks of multiline text?

I have a text file that contains over 200 records of the following format:

 @INPROCEEDINGS{Rajan-Sullivan03,
  author = {Hridesh Rajan and Kevin J. Sullivan},
  title = {{{Eos}: Instance-Level Aspects for Integrated System Design}},
  booktitle = {ESEC/FSE 2003},
  year = {2003},
  pages = {297--306},
  month = sep,
  isbn = {1-58113-743-5},
  location = {Helsinki, FN},
  owner = {Administrator},
  timestamp = {2009.03.08}
}

@INPROCEEDINGS{ras-mor-models-06,
  author = {Awais Rashid and Ana Moreira},
  title = {Domain Models Are {NOT} Aspect Free},
  booktitle = {MoDELS},
  year = {2006},
  editor = {Oscar Nierstrasz and Jon Whittle and David Harel and Gianna Reggio},
  volume = {4199},
  series = {Lecture Notes in Computer Science},
  pages = {155--169},
  publisher = {Springer},
  bibdate = {2006-12-07},
  bibsource = {DBLP, http://dblp.uni-trier.de/db/conf/models/models2006.html#RashidM06},
  isbn = {3-540-45772-0},
  owner = {aljasser},
  timestamp = {2008.09.16},
  url = {http://dx.doi.org/10.1007/11880240_12}
}

Basically a records starts with @ and ends with a }, so what i tried to do is start with @ and end with }\\n} but didn't work, it will only match the first record and the other one because there is no new line after it.

            string pattern = @"(^@)([\s\S]*)(}$\n}(\n))";

and when i tried to fix it by making it, it matched everything as one match

 string pattern = @"(^@)([\s\S]*)(}$\n}(\n*))";

I have tried until i reached the following pattern but it's not working, please if you can fix it or maybe give a more efficient one plus a little explanation on it's done.

Here is my code:

            string pattern = @"(^@)([\s\S]*)(}$\n}(\n))";
        Regex regex = new Regex(pattern,RegexOptions.Multiline);
        var matches = regex.Matches(bibFileContent).Cast<Match>().Select(m => m.Value).ToList();

If you use the Matches method, you need this kind of patterns, that deal with balanced curly brackets:

string pattern = @"@[A-Z]+{(?>[^{}]+|(?<open>{)|(?<-open>}))*(?(open)(?!))}";
Regex regex = new Regex(pattern);

or to ensure that all results are well-formed (in a brackets point of view) :

string pattern = @"\G[^{}]*(@[A-Z]+{(?>[^{}]+|(?<open>{)|(?<-open>}))*(?(open)(?!))})";

These two patterns use named captures as a counter. When an opening bracket is met the counter is incremented, when a closing bracket is met the counter is decremented. (?(open)(?!)) is a conditional test that makes the pattern fail if the counter isn't null.

online demo

If chuncks do not contain the @ character, it will be more handy to use the Regex.Split(input, pattern) method:

string[] result = Regex.Split(input, @"[^}]*(?=@)");

If chuncks can contain the @ character, you can make it more robust with a more descriptive lookahead:

string[] result = Regex.Split(input, @"[^}]*(?=@[A-Z]+{)");

or

string[] result = Regex.Split(input, @"\s*(?=@[A-Z]+{)");

I think the problem is that your input does not finish by \\n so your second record is not matched. You should put an alternation with $

This will get in group 1 the records:

@(.*?)^}(?:[\r\n]+|$)

DEMO

Notice you have to use the m and s modifiers

Use this code:

Regex regex = new Regex(pattern, RegexOptions.Multiline | RegexOptions.Singleline);
MatchCollection mc = regex.Matches(bibFileContent);
List<String> results = new List<String>();
foreach (Group m in mc[0].Groups)
{
results.Add(m.Value);
}

You could use a simple regex like this:

(@[^@]+)

Working demo

在此处输入图片说明

The idea is to match content that starts with @ and doesn't have another @. Btw, if you just want to match the pattern instead of capturing it just remove the capturin group:

@[^@]+

This looks like a candidate for balanced groups.

 # @"(?m)^[^\S\r\n]*@[^{}]+(?:\{(?>[^{}]+|\{(?<Depth>)|\}(?<-Depth>))*(?(Depth)(?!))\})"

 (?m)
 ^ [^\S\r\n]* 
 @ [^{}]+ 
 (?:
      \{                            # Match opening {
      (?>                           # Then either match (possessively):
           [^{}]+                        #   Anything (but only if we're not at the start of { or } )
        |                              # or
           \{                            #  { (and increase the braces counter)
           (?<Depth> )
        |                              # or
           \}                            #  } (and decrease the braces counter).
           (?<-Depth> )
      )*                            # Repeat as needed.
      (?(Depth)                     # Assert that the braces counter is at zero.
           (?!)                          # Fail if it isn't
      )
      \}                            # Then match a closing }. 
 )

Code sample

Regex FghRx = new Regex( @"(?m)^[^\S\r\n]*@[^{}]+(?:\{(?>[^{}]+|\{(?<Depth>)|\}(?<-Depth>))*(?(Depth)(?!))\})" );
string FghData =
@"
@INPROCEEDINGS{Rajan-Sullivan03,
author = {Hridesh Rajan and Kevin J. Sullivan},
  title = {{{Eos}: Instance-Level Aspects for Integrated System Design}},
  booktitle = {ESEC/FSE 2003},
  year = {2003},
  pages = {297--306},
  month = sep,
  isbn = {1-58113-743-5},
  location = {Helsinki, FN},
  owner = {Administrator},
  timestamp = {2009.03.08}
}

@INPROCEEDINGS{ras-mor-models-06,
  author = {Awais Rashid and Ana Moreira},
  title = {Domain Models Are {NOT} Aspect Free},
  booktitle = {MoDELS},
  year = {2006},
  editor = {Oscar Nierstrasz and Jon Whittle and David Harel and Gianna Reggio},
  volume = {4199},
  series = {Lecture Notes in Computer Science},
  pages = {155--169},
  publisher = {Springer},
  bibdate = {2006-12-07},
  bibsource = {DBLP, http://dblp.uni-trier.de/db/conf/models/models2006.html#RashidM06},
  isbn = {3-540-45772-0},
  owner = {aljasser},
  timestamp = {2008.09.16},
  url = {http://dx.doi.org/10.1007/11880240_12}
}
";

Match FghMatch = FghRx.Match(FghData);
while (FghMatch.Success)
{
    Console.WriteLine("New Record\n------------------------");
    Console.WriteLine("{0}", FghMatch.Groups[0].Value);
    FghMatch = FghMatch.NextMatch();
    Console.WriteLine("");
}

Output

New Record
------------------------
@INPROCEEDINGS{Rajan-Sullivan03,
author = {Hridesh Rajan and Kevin J. Sullivan},
  title = {{{Eos}: Instance-Level Aspects for Integrated System Design}},
  booktitle = {ESEC/FSE 2003},
  year = {2003},
  pages = {297--306},
  month = sep,
  isbn = {1-58113-743-5},
  location = {Helsinki, FN},
  owner = {Administrator},
  timestamp = {2009.03.08}
}

New Record
------------------------
@INPROCEEDINGS{ras-mor-models-06,
  author = {Awais Rashid and Ana Moreira},
  title = {Domain Models Are {NOT} Aspect Free},
  booktitle = {MoDELS},
  year = {2006},
  editor = {Oscar Nierstrasz and Jon Whittle and David Harel and Gianna Reggio},
  volume = {4199},
  series = {Lecture Notes in Computer Science},
  pages = {155--169},
  publisher = {Springer},
  bibdate = {2006-12-07},
  bibsource = {DBLP, http://dblp.uni-trier.de/db/conf/models/models2006.html#RashidM06},
  isbn = {3-540-45772-0},
  owner = {aljasser},
  timestamp = {2008.09.16},
  url = {http://dx.doi.org/10.1007/11880240_12}
}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM