简体   繁体   中英

Regex for multiline header c#

I'm a new c# programmer. I'm trying to make a simple c# application which will extract headers from a pdf file(book) if they are in this format :

1.1 THE ELECTRICAL/ELECTRONICS INDUSTRY

1.2 A BRIEF HISTORY

1.3 UNITS OF MEASUREMENT

I'm using the code:

string pattern = @"(\d+)(\.)(\d+) ([A-Z]+).([A-Z]+).([A-Z]+).([A-Z]+).([A-Z]+)";
Regex.match(strText,pattern); 

which works fine for single line headers but doesn't work for two line/multiline headers. Can anyone help please ?

I'm unfamiliar with C# style regex, but isn't a . an any character match (except new line)?

If you need new lines then you're going to also have to include an actual \\n at the end, probably with a ? as well unless you plan to have an alternative as well.

But I'm kind of surprised that this regex isn't causing any issues, unless the formatting of book so happens to be perfect.

Assuming that you have already get the required table of contents in single string and the only problem is to parse second level headers.

Regular expression modified for matching only capital letters.

You can achieve the required result with the following code:

    string pattern = @"((\d+\.\d+) ([A-Z\s]+)\n)+";
    var match = Regex.Match(input, pattern);

    var headers = new List<string>();
    for (var i = 0; i < match.Groups[1].Captures.Count; i++)
    {
        headers.Add(match.Groups[1].Captures[i].Value);
    }

And after it headers will contain all required data.

Assuming that input contains input data. Also, note that \\n is new line character.

Your regex simplified.

(\\d+\\.\\d+) stands for sequence of "one or more numeric character", dot, "one or more numeric character", space.

([AZ\\s]+)\\n - "one or more capital letter or space", "new line character"

Also, read the following article to get familiar with C# regular expressions.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM