简体   繁体   中英

What is the best way to parse out this string inside a string?

I have the following string:

 string fullString = "group = '2843360' and (team in ('TEAM1', 'TEAM2','TEAM3'))"

And I want to parse out of this string into

 string group = ParseoutGroup(fullString);  // Expect "2843360"
 string[] teams = ParseoutTeamNames(fullString); // Expect array with three items

In terms of the example of full string, I could have one or many teams listed (not always three as in above).

I have this partially working, but my code feels very hacky and not very future proof, so I wanted to see if there was a better regular expression solution here or a more elegant way to parse these values out of this full string? There could be other things added later to the string, so I want this to be as foolproof as possible.

In the simplest case regular expression might be the best answer. Unfortunately, in this case, it seems that we need to parse a subset of SQL language. While it is possible to solve this with regular expressions, they are not designed to parse complex languages (nested brackets and escaped strings).

It is also possible that the requirements will evolve over time and it will be required to parse more complex structures.

If company policy allows, I will chose to build internal DSL in order to parse this string.

One of my favorite tools to build internal DLSs is called Sprache

Below you can find an example parser using internal DSL approach.

In the code I've defined primitives to handle required SQL operators and composed final parser out of those.

    [Test]
    public void Test()
    {
        string fullString = "group = '2843360' and (team in ('TEAM1', 'TEAM2','TEAM3'))";


        var resultParser =
            from @group in OperatorEquals("group")
            from @and in OperatorEnd()
            from @team in Brackets(OperatorIn("team"))
            select new {@group, @team};
        var result = resultParser.Parse(fullString);
        Assert.That(result.group, Is.EqualTo("2843360"));
        Assert.That(result.team, Is.EquivalentTo(new[] {"TEAM1", "TEAM2", "TEAM3"}));
    }

    private static readonly Parser<char> CellSeparator =
        from space1 in Parse.WhiteSpace.Many()
        from s in Parse.Char(',')
        from space2 in Parse.WhiteSpace.Many()
        select s;

    private static readonly Parser<char> QuoteEscape = Parse.Char('\\');

    private static Parser<T> Escaped<T>(Parser<T> following)
    {
        return from escape in QuoteEscape
               from f in following
               select f;
    }

    private static readonly Parser<char> QuotedCellDelimiter = Parse.Char('\'');

    private static readonly Parser<char> QuotedCellContent =
        Parse.AnyChar.Except(QuotedCellDelimiter).Or(Escaped(QuotedCellDelimiter));

    private static readonly Parser<string> QuotedCell =
        from open in QuotedCellDelimiter
        from content in QuotedCellContent.Many().Text()
        from end in QuotedCellDelimiter
        select content;

    private static Parser<string> OperatorEquals(string column)
    {
        return
            from c in Parse.String(column)
            from space1 in Parse.WhiteSpace.Many()
            from opEquals in Parse.Char('=')
            from space2 in Parse.WhiteSpace.Many()
            from content in QuotedCell
            select content;
    }

    private static Parser<bool> OperatorEnd()
    {
        return
            from space1 in Parse.WhiteSpace.Many()
            from c in Parse.String("and")
            from space2 in Parse.WhiteSpace.Many()
            select true;
    }

    private static Parser<T> Brackets<T>(Parser<T> contentParser)
    {
        return from open in Parse.Char('(')
               from space1 in Parse.WhiteSpace.Many()
               from content in contentParser
               from space2 in Parse.WhiteSpace.Many()
               from close in Parse.Char(')')
               select content;
    }

    private static Parser<IEnumerable<string>> ComaSeparated()
    {
        return from leading in QuotedCell
               from rest in CellSeparator.Then(_ => QuotedCell).Many()
               select Cons(leading, rest);
    }

    private static Parser<IEnumerable<string>> OperatorIn(string column)
    {
        return
            from c in Parse.String(column)
            from space1 in Parse.WhiteSpace
            from opEquals in Parse.String("in")
            from space2 in Parse.WhiteSpace.Many()
            from content in Brackets(ComaSeparated())
            from space3 in Parse.WhiteSpace.Many()
            select content;
    }

    private static IEnumerable<T> Cons<T>(T head, IEnumerable<T> rest)
    {
        yield return head;
        foreach (T item in rest)
            yield return item;
    }

I managed to do that using regular expressions :

var str = "group = '2843360' and (team in ('TEAM1', 'TEAM2','TEAM3'))";

// Grabs the group ID
var group = Regex.Match(str, @"group = '(?<ID>\d+)'", RegexOptions.IgnoreCase)
    .Groups["ID"].Value;

// Grabs everything inside teams parentheses
var teams = Regex.Match(str, @"team in \((?<Teams>(\s*'[^']+'\s*,?)+)\)", RegexOptions.IgnoreCase)
    .Groups["Teams"].Value;

// Trim and remove single quotes
var teamsArray = teams.Split(new char[] { ',' }, StringSplitOptions.RemoveEmptyEntries)
    .Select(s =>
        {
            var trimmed = s.Trim();
            return trimmed.Substring(1, trimmed.Length - 2);
        }).ToArray();

The result will be:

string[] { "TEAM1", "TEAM2", "TEAM3" }

There's probably a regex solution for this, but if the format is strict I try efficient string methods first. The following works with your input.

I'm using a custom class, TeamGroup , to encapsulate complexity and to hold all relevant properties in one object:

public class TeamGroup
{
    public string Group { get; set; }
    public string[] Teams { get; set; }

    public static TeamGroup ParseOut(string fullString)
    {
        TeamGroup tg = new TeamGroup{ Teams = new string[]{ } };
        int index = fullString.IndexOf("group = '");
        if (index >= 0)
        {
            index += "group = '".Length;
            int endIndex = fullString.IndexOf("'", index);
            if (endIndex >= 0)
            {
                tg.Group = fullString.Substring(index, endIndex - index).Trim(' ', '\'');
                endIndex += 1;
                index = fullString.IndexOf(" and (team in (", endIndex);
                if (index >= 0)
                {
                    index += " and (team in (".Length;
                    endIndex = fullString.IndexOf(")", index);
                    if (endIndex >= 0)
                    {
                        string allTeamsString = fullString.Substring(index, endIndex - index);
                        tg.Teams = allTeamsString.Split(new[] { ',' }, StringSplitOptions.RemoveEmptyEntries)
                            .Select(t => t.Trim(' ', '\''))
                            .ToArray();
                    }
                }
            }
        }
        return tg;
    }
}

You would use it in this way:

string fullString = "group = '2843360' and (team in ('TEAM1', 'TEAM2','TEAM3'))";
TeamGroup tg = TeamGroup.ParseOut(fullString);
Console.Write("Group: {0} Teams: {1}", tg.Group, string.Join(", ", tg.Teams));

Outputs:

Group: 2843360 Teams: TEAM1, TEAM2, TEAM3

I think you will need to look into a tokenization process in order to get the desired result and take into consideration the order of execution established by the parenthesis. You may be able to use the shunting-yard algorithm to assist with tokenization and execution order.

The advantage of the shunting-yard is that it allows you to define tokens that can be later used to property parse the string and execute the proper operation. While it normally applies to mathematical order of operations it can be adapted to fit your purpose.

Here is some information:

http://en.wikipedia.org/wiki/Shunting-yard_algorithm http://www.slideshare.net/grahamwell/shunting-yard

If fullString is not machine generated, you may need to add some error catching, but this will work out of the box, and give you a test to work against.

    public string ParseoutGroup(string fullString)
    {
        var matches = Regex.Matches(fullString, @"group\s?=\s?'([^']+)'", RegexOptions.IgnoreCase);
        return matches[0].Groups[1].Captures[0].Value;
    }

    public string[] ParseoutTeamNames(string fullString)
    {
        var teams = new List<string>();
        var matches = Regex.Matches(fullString, @"team\s?in\s?\((\s*'([^']+)',?\s*)+\)", RegexOptions.IgnoreCase);
        foreach (var capture in matches[0].Groups[2].Captures)
        {
            teams.Add(capture.ToString());
        }
        return teams.ToArray();
    }

    [Test]
    public void parser()
    {
        string test = "group = '2843360' and (team in ('team1', 'team2', 'team3'))";
        var group = ParseoutGroup(test);
        Assert.AreEqual("2843360",group);

        var teams = ParseoutTeamNames(test);
        Assert.AreEqual(3, teams.Count());
        Assert.AreEqual("team1", teams[0]);
        Assert.AreEqual("team2", teams[1]);
        Assert.AreEqual("team3", teams[2]);
    }

An addition to @BrunoLM's solution:

(Worth the extra lines if you'll have more variables to check later on):

You can split the string on the "and" keyword and have a function to check each clause against appropriate regex statement and return the desired value.

(Untested code, but it should deliver the idea.)

statments = statment.split('and')
//So now:
//statments[0] = "group = '2843360' "
//statments[1] = "(team in ('TEAM1', 'TEAM2','TEAM3'))"
foreach s in statments {
    if (s.contains('group') group = RegexFunctionToExtract_GroupValue(s) ;
    if (s.contains('team') teams = RegexFunctionToExtract_TeamValue(s) ;
}

I believe that this approach will deliver cleaner, easy-to-maintain code and a slight optimization.

Of course this approach doesn't expect an "OR" clause. However, it can be done with a little more tweaking.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM