简体   繁体   中英

Splitting a string on / when not within [ ]

I'm trying to split a string representing an XPath such as:

string myPath = "/myns:Node1/myns:Node2[./myns:Node3=123456]/myns:Node4";

I need to split on '/' (the '/' excluded from results, as with a normal string split) unless the '/' happens to be within the '[ ... ]' (where the '/' would both not be split on, and also included in the result).

So what a normal string[] result = myPath.Split("/".ToCharArray()) gets me:

result[0]: //Empty string, this is ok
result[1]: myns:Node1
result[2]: myns:Node2[.
result[3]: myns:Node3=123456]
result[4]: myns:Node4

results[2] and result[3] should essentially be combined and I should end up with:

result[0]: //Empty string, this is ok
result[1]: myns:Node1
result[2]: myns:Node2[./myns:Node3=123456]
result[3]: myns:Node4

Since I'm not super fluent in regex, I've tried manually recombining the results into a new array after the split, but what concerns me is that while it's trivial to get it to work for this example, regex seems the better option in the case where I get more complex xpaths.

For the record, I have looked at the following questions:
Regex split string preserving quotes
C# Regex Split - commas outside quotes
Split a string that has white spaces, unless they are enclosed within "quotes"?

While they should be sufficient in helping be with my problem, I'm running into a few issues/confusing aspects that prevent them from helping me.
In the first 2 links, as a newbie to regex I'm finding them hard to interpret and learn from. They are looking for quotes, which look identical between left and right pairs, so translating it to [ and ] is confusing me, and trial and error is not teaching me anything, rather, it's just frustrating me more. I can understand fairly basic regex, but what these answers do is a little more than what I currently understand, even with the explanation in the first link.
In the third link, I won't have access to LINQ as the code will be used in an older version of .NET.

XPath is a complex language, trying to split an XPath expression on slashes at ground level fails in many situations, examples:

/myns:Node1/myns:Node2[./myns:Node3=123456]/myns:Node4
string(/myns:Node1/myns:Node2)

I suggest an other approach to cover more cases. Instead of trying to split, try to match each parts between slashes with the Regex.Matches(String, String) method. The advantage of this way is that you can freely describe how look these parts:

string pattern = @"(?xs)
    [^][/()]+ # all that isn't a slash or a bracket
    (?: # predicates (eventually nested)
        \[ 
        (?: [^]['""] | (?<c>\[) | (?<-c>] )
          | "" (?> [^""\\]* (?: \\. [^""\\]* )* ) "" # quoted parts
          | '  (?> [^'\\]*  (?: \\. [^'\\]*  )* ) '
        )*?
        (?(c)(?!$)) # check if brackets are balanced
        ]
      |  # same thing for round brackets
        \(
        (?: [^()'""] | (?<d>\() | (?<-d>\) )
          | "" (?> [^""\\]* (?: \\. [^""\\]* )* ) ""
          | '  (?> [^'\\]*  (?: \\. [^'\\]*  )* ) '
        )*?
        (?(d)(?!$))
        \)
    )*
  |
    (?<![^/])(?![^/]) # empty string between slashes, at the start or end
";

Note: to be sure that the string is entirely parsed, you can add at the end of the pattern something like: |\\z(?<=(.)) . This way, you can test if the capturing group exists to know if you are at the end of the string. (But you can also use the match position, the length and the length of the string.)

demo

If a Regex pattern of a complexity like Casimir et Hippolyte suggests is required, then perhaps Regex is not the best option in this circumstance. To add a non-Regex possible solution, here is what the process might look like when the XPath string is parsed manually:

public string[] Split(string input, char splitChar, char groupStart, char groupEnd)
{
    List<string> splits = new List<string>();

    int startIdx = 0;
    int groupNo = 0;

    for (int i = 0; i < input.Length; i++)
    {
        if (input[i] == splitChar && groupNo == 0)
        {
            splits.Add(input.Substring(startIdx, i - startIdx));
            startIdx = i + 1;
        }
        else if (input[i] == groupStart)
        {
            groupNo++;
        }
        else if (input[i] == groupEnd)
        {
            groupNo = Math.Max(groupNo - 1, 0);
        }
    }

    splits.Add(input.Substring(startIdx, input.Length - startIdx));

    return splits.Where(s => !string.IsNullOrEmpty(s)).ToArray();
}

Personally, I think this is much easier to both understand and implement. To use it, you can do the following:

var input = "/myns:Node1/myns:Node2[./myns:Node3=123456]/myns:Node4[text(‌​)='some[] brackets']";
var split = Split(input, '/', '[', ']');

This will output the following:

split[0] = "myns:Node1"
split[1] = "myns:Node2[./myns:Node3=123456]"
split[2] = "myns:Node4[text(‌​)='some[] brackets']"

The second link you posted is actually perfect for your needs. All it needs is some tweaking to detect brackets instead of apostrophes:

\/(?=(?:[^[]*\[[^\]]*])*[^]]*$)

Basically what it does is it only includes forward slashes that are proceeded by a left square bracket and then a right square bracket before the next forward slash. You can use it like so:

string[] matches = Regex.Split(myPath, "\\/(?=(?:[^[]*\\[[^\\]]*])*[^]]*$)")
\/(?![^\[]*\])

Try this.See demo.

https://regex101.com/r/uLcWux/1

Use with @ or \\\\/(?![^\\\\[]*\\\\])

PS This is only for simple xpaths not having nested parenthesis or [] inside quotes

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM