简体   繁体   English

在字符串中解析这个字符串的最佳方法是什么?

[英]What is the best way to parse out this string inside a string?

I have the following string: 我有以下字符串:

 string fullString = "group = '2843360' and (team in ('TEAM1', 'TEAM2','TEAM3'))"

And I want to parse out of this string into 我想解析出这个字符串

 string group = ParseoutGroup(fullString);  // Expect "2843360"
 string[] teams = ParseoutTeamNames(fullString); // Expect array with three items

In terms of the example of full string, I could have one or many teams listed (not always three as in above). 就完整字符串的示例而言,我可以列出一个或多个团队(并不总是如上所述的三个)。

I have this partially working, but my code feels very hacky and not very future proof, so I wanted to see if there was a better regular expression solution here or a more elegant way to parse these values out of this full string? 我有这个部分工作,但我的代码感觉非常hacky并不是非常未来证明,所以我想看看这里是否有更好的正则表达式解决方案或更优雅的方式从这个完整的字符串解析这些值? There could be other things added later to the string, so I want this to be as foolproof as possible. 之后可能会在字符串中添加其他内容,因此我希望这样做尽可能万无一失。

In the simplest case regular expression might be the best answer. 在最简单的情况下,正则表达式可能是最好的答案。 Unfortunately, in this case, it seems that we need to parse a subset of SQL language. 不幸的是,在这种情况下,我们似乎需要解析SQL语言的一个子集。 While it is possible to solve this with regular expressions, they are not designed to parse complex languages (nested brackets and escaped strings). 虽然可以使用正则表达式解决此问题,但它们并非旨在解析复杂语言(嵌套括号和转义字符串)。

It is also possible that the requirements will evolve over time and it will be required to parse more complex structures. 需求也可能随着时间的推移而发展,并且需要解析更复杂的结构。

If company policy allows, I will chose to build internal DSL in order to parse this string. 如果公司政策允许,我将选择构建内部DSL以解析此字符串。

One of my favorite tools to build internal DLSs is called Sprache 我最喜欢的构建内部DLS的工具之一叫做Sprache

Below you can find an example parser using internal DSL approach. 您可以在下面找到使用内部DSL方法的示例解析器。

In the code I've defined primitives to handle required SQL operators and composed final parser out of those. 在代码中,我定义了基元来处理所需的SQL运算符,并用这些运算符组成最终的解析器。

    [Test]
    public void Test()
    {
        string fullString = "group = '2843360' and (team in ('TEAM1', 'TEAM2','TEAM3'))";


        var resultParser =
            from @group in OperatorEquals("group")
            from @and in OperatorEnd()
            from @team in Brackets(OperatorIn("team"))
            select new {@group, @team};
        var result = resultParser.Parse(fullString);
        Assert.That(result.group, Is.EqualTo("2843360"));
        Assert.That(result.team, Is.EquivalentTo(new[] {"TEAM1", "TEAM2", "TEAM3"}));
    }

    private static readonly Parser<char> CellSeparator =
        from space1 in Parse.WhiteSpace.Many()
        from s in Parse.Char(',')
        from space2 in Parse.WhiteSpace.Many()
        select s;

    private static readonly Parser<char> QuoteEscape = Parse.Char('\\');

    private static Parser<T> Escaped<T>(Parser<T> following)
    {
        return from escape in QuoteEscape
               from f in following
               select f;
    }

    private static readonly Parser<char> QuotedCellDelimiter = Parse.Char('\'');

    private static readonly Parser<char> QuotedCellContent =
        Parse.AnyChar.Except(QuotedCellDelimiter).Or(Escaped(QuotedCellDelimiter));

    private static readonly Parser<string> QuotedCell =
        from open in QuotedCellDelimiter
        from content in QuotedCellContent.Many().Text()
        from end in QuotedCellDelimiter
        select content;

    private static Parser<string> OperatorEquals(string column)
    {
        return
            from c in Parse.String(column)
            from space1 in Parse.WhiteSpace.Many()
            from opEquals in Parse.Char('=')
            from space2 in Parse.WhiteSpace.Many()
            from content in QuotedCell
            select content;
    }

    private static Parser<bool> OperatorEnd()
    {
        return
            from space1 in Parse.WhiteSpace.Many()
            from c in Parse.String("and")
            from space2 in Parse.WhiteSpace.Many()
            select true;
    }

    private static Parser<T> Brackets<T>(Parser<T> contentParser)
    {
        return from open in Parse.Char('(')
               from space1 in Parse.WhiteSpace.Many()
               from content in contentParser
               from space2 in Parse.WhiteSpace.Many()
               from close in Parse.Char(')')
               select content;
    }

    private static Parser<IEnumerable<string>> ComaSeparated()
    {
        return from leading in QuotedCell
               from rest in CellSeparator.Then(_ => QuotedCell).Many()
               select Cons(leading, rest);
    }

    private static Parser<IEnumerable<string>> OperatorIn(string column)
    {
        return
            from c in Parse.String(column)
            from space1 in Parse.WhiteSpace
            from opEquals in Parse.String("in")
            from space2 in Parse.WhiteSpace.Many()
            from content in Brackets(ComaSeparated())
            from space3 in Parse.WhiteSpace.Many()
            select content;
    }

    private static IEnumerable<T> Cons<T>(T head, IEnumerable<T> rest)
    {
        yield return head;
        foreach (T item in rest)
            yield return item;
    }

I managed to do that using regular expressions : 我设法使用正则表达式

var str = "group = '2843360' and (team in ('TEAM1', 'TEAM2','TEAM3'))";

// Grabs the group ID
var group = Regex.Match(str, @"group = '(?<ID>\d+)'", RegexOptions.IgnoreCase)
    .Groups["ID"].Value;

// Grabs everything inside teams parentheses
var teams = Regex.Match(str, @"team in \((?<Teams>(\s*'[^']+'\s*,?)+)\)", RegexOptions.IgnoreCase)
    .Groups["Teams"].Value;

// Trim and remove single quotes
var teamsArray = teams.Split(new char[] { ',' }, StringSplitOptions.RemoveEmptyEntries)
    .Select(s =>
        {
            var trimmed = s.Trim();
            return trimmed.Substring(1, trimmed.Length - 2);
        }).ToArray();

The result will be: 结果将是:

string[] { "TEAM1", "TEAM2", "TEAM3" }

There's probably a regex solution for this, but if the format is strict I try efficient string methods first. 可能有一个正则表达式解决方案,但如果格式严格,我首先尝试高效的字符串方法。 The following works with your input. 以下适用于您的输入。

I'm using a custom class, TeamGroup , to encapsulate complexity and to hold all relevant properties in one object: 我正在使用自定义类TeamGroup来封装复杂性并将所有相关属性保存在一个对象中:

public class TeamGroup
{
    public string Group { get; set; }
    public string[] Teams { get; set; }

    public static TeamGroup ParseOut(string fullString)
    {
        TeamGroup tg = new TeamGroup{ Teams = new string[]{ } };
        int index = fullString.IndexOf("group = '");
        if (index >= 0)
        {
            index += "group = '".Length;
            int endIndex = fullString.IndexOf("'", index);
            if (endIndex >= 0)
            {
                tg.Group = fullString.Substring(index, endIndex - index).Trim(' ', '\'');
                endIndex += 1;
                index = fullString.IndexOf(" and (team in (", endIndex);
                if (index >= 0)
                {
                    index += " and (team in (".Length;
                    endIndex = fullString.IndexOf(")", index);
                    if (endIndex >= 0)
                    {
                        string allTeamsString = fullString.Substring(index, endIndex - index);
                        tg.Teams = allTeamsString.Split(new[] { ',' }, StringSplitOptions.RemoveEmptyEntries)
                            .Select(t => t.Trim(' ', '\''))
                            .ToArray();
                    }
                }
            }
        }
        return tg;
    }
}

You would use it in this way: 你会以这种方式使用它:

string fullString = "group = '2843360' and (team in ('TEAM1', 'TEAM2','TEAM3'))";
TeamGroup tg = TeamGroup.ParseOut(fullString);
Console.Write("Group: {0} Teams: {1}", tg.Group, string.Join(", ", tg.Teams));

Outputs: 输出:

Group: 2843360 Teams: TEAM1, TEAM2, TEAM3

I think you will need to look into a tokenization process in order to get the desired result and take into consideration the order of execution established by the parenthesis. 我认为你需要研究一个标记化过程,以获得所需的结果,并考虑括号建立的执行顺序。 You may be able to use the shunting-yard algorithm to assist with tokenization and execution order. 您可以使用分流码算法来协助标记化和执行顺序。

The advantage of the shunting-yard is that it allows you to define tokens that can be later used to property parse the string and execute the proper operation. 分流码的优点是它允许您定义令牌,以后可以用于属性解析字符串并执行正确的操作。 While it normally applies to mathematical order of operations it can be adapted to fit your purpose. 虽然它通常适用于数学操作顺序,但它可以根据您的目的进行调整。

Here is some information: 以下是一些信息:

http://en.wikipedia.org/wiki/Shunting-yard_algorithm http://www.slideshare.net/grahamwell/shunting-yard http://en.wikipedia.org/wiki/Shunting-yard_algorithm http://www.slideshare.net/grahamwell/shunting-yard

If fullString is not machine generated, you may need to add some error catching, but this will work out of the box, and give you a test to work against. 如果fullString不是机器生成的,你可能需要添加一些错误捕获,但这将开箱即用,并给你一个测试工作。

    public string ParseoutGroup(string fullString)
    {
        var matches = Regex.Matches(fullString, @"group\s?=\s?'([^']+)'", RegexOptions.IgnoreCase);
        return matches[0].Groups[1].Captures[0].Value;
    }

    public string[] ParseoutTeamNames(string fullString)
    {
        var teams = new List<string>();
        var matches = Regex.Matches(fullString, @"team\s?in\s?\((\s*'([^']+)',?\s*)+\)", RegexOptions.IgnoreCase);
        foreach (var capture in matches[0].Groups[2].Captures)
        {
            teams.Add(capture.ToString());
        }
        return teams.ToArray();
    }

    [Test]
    public void parser()
    {
        string test = "group = '2843360' and (team in ('team1', 'team2', 'team3'))";
        var group = ParseoutGroup(test);
        Assert.AreEqual("2843360",group);

        var teams = ParseoutTeamNames(test);
        Assert.AreEqual(3, teams.Count());
        Assert.AreEqual("team1", teams[0]);
        Assert.AreEqual("team2", teams[1]);
        Assert.AreEqual("team3", teams[2]);
    }

An addition to @BrunoLM's solution: @ BrunoLM解决方案的补充:

(Worth the extra lines if you'll have more variables to check later on): (如果您稍后要检查更多变量,则值得额外的行):

You can split the string on the "and" keyword and have a function to check each clause against appropriate regex statement and return the desired value. 您可以在“and”关键字上拆分字符串,并具有一个函数来检查每个子句是否符合相应的正则表达式语句并返回所需的值。

(Untested code, but it should deliver the idea.) (未经测试的代码,但它应该提供这个想法。)

statments = statment.split('and')
//So now:
//statments[0] = "group = '2843360' "
//statments[1] = "(team in ('TEAM1', 'TEAM2','TEAM3'))"
foreach s in statments {
    if (s.contains('group') group = RegexFunctionToExtract_GroupValue(s) ;
    if (s.contains('team') teams = RegexFunctionToExtract_TeamValue(s) ;
}

I believe that this approach will deliver cleaner, easy-to-maintain code and a slight optimization. 我相信这种方法可以提供更清晰,易于维护的代码和轻微的优化。

Of course this approach doesn't expect an "OR" clause. 当然,这种方法不期望“OR”条款。 However, it can be done with a little more tweaking. 但是,可以通过稍微调整来完成。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM