简体   繁体   中英

Regex parsing of a multi-line entry with optional newline characters

I am almost a novice in regex. I am trying to parse the outputs from a CommandLineInterface (CLI). The output typically is the contents of files and folders under a specified path. The following could be the potential different formats of the output.

CLI Output format 1

d:\ARCTest\_MyProject\Sources\CMCore\project.pj subsandbox <CRLF>
<space> d:\ARCTest\_MyProject\Sources\CMInterfaces\project.pj subsandbox <CRLF>
<space> d:\ARCTest\_MyProject\Sources\CMImplementation\project.pj subsandbox <CRLF>
<space> d:\ARCTest\_MyProject\Sources\Übersicht und fragen\project.pj subsandbox <CRLF>
<space> d:\ARCTest\_MyProject\Sources\CMAccess.sln archived 1.15 <CRLF>
<space> d:\ARCTest\_MyProject\Sources\übersicht und fragen.xlsx archived 1.1

For format 1, before the second line onward there is a CRLF and an addition space (I have denoted this using the symbols <space> and <CRLF> , please note that this is not part of the actual output). The CRLF might not be always present in the output, in other words it is optional. The first four are paths to mks folders and the last two are files in mks. What I want is all matches that points to folders (those end in \\project.pj, including the project.pj) and matches that point to files (those end with the word archived, excluding the text archived).

CLI Output format 2

CMCore/project.pj subproject <CRLF>
CMInterfaces/project.pj subproject <CRLF>
CMImplementation/project.pj subproject <CRLF>
Übersicht und fragen/project.pj subproject <CRLF>
CMAccess.sln archived <CRLF>
übersicht und frögen.xlsx archived

For format 2, before the second line onward there is a CRLF (I have denoted this using the symbol , please note that this is not part of the actual output). The first four are paths to mks folders and the last two are files in mks. What I want is all matches that points to folders (those end in \\project.pj, including the project.pj) and matches that point to files (those end with the word archived, excluding the text archived).

I was almost successfull to parse the folders in both cases using the rgular expression ^([^\\r\\n]\\w+.+?\\.pj) , but it failed to fetch the first line from output format 1. I was not able to figure out a solution to parse the files in both the formats. Any solution would be of great help.

Please let me know if I need to provide more information on this.

Thanks in advance, Joe.

Try this:

([\w ]\S+\/*)*\w([\w]+\.(\w+))

using System;
using System.Text.RegularExpressions;

public class Test
{
    public static void Main()
    {
        string patternDir = @"([\w ]\S+\/*)*\w([\w]+\.(pj))";

        string pathDir = @"d:\ARCTest\_MyProject\Sources\CMInterfaces\project.pj subsandbox ";
        string pathFile = @"CMAccess.sln archived";

        Console.WriteLine((Regex.IsMatch(pathDir,patternDir))? "It's dir!" : "It's not a dir");
        Console.WriteLine((Regex.IsMatch(pathFile,patternDir))? "It's dir!" : "It's not a dir");

        Console.ReadKey();
    }
}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM