简体   繁体   中英

Find all substrings between two strings

I need to get all substrings from string.
For ex:

StringParser.GetSubstrings("[start]aaaaaa[end] wwwww [start]cccccc[end]", "[start]", "[end]");

that returns 2 string "aaaaaa" and "cccccc" Suppose we have only one level of nesting. Not sure about regexp, but I think it will be userful.

private IEnumerable<string> GetSubStrings(string input, string start, string end)
{
    Regex r = new Regex(Regex.Escape(start) + "(.*?)" + Regex.Escape(end));
    MatchCollection matches = r.Matches(input);
    foreach (Match match in matches)
        yield return match.Groups[1].Value;
}

Here's a solution that doesn't use regular expressions and doesn't take nesting into consideration.

public static IEnumerable<string> EnclosedStrings(
    this string s, 
    string begin, 
    string end)
{
    int beginPos = s.IndexOf(begin, 0);
    while (beginPos >= 0)
    {
        int start = beginPos + begin.Length;
        int stop = s.IndexOf(end, start);
        if (stop < 0)
            yield break;
        yield return s.Substring(start, stop - start);
        beginPos = s.IndexOf(begin, stop+end.Length);
    }           
}

You're going to need to better define the rules that govern your matching needs. When building any kind of matching or search code you need to be vary clear about what inputs you anticipate and what outputs you need to produce. It's very easy to produce buggy code if you don't take these questions into close consideration. That said...

You should be able to use regular expressions. Nesting may make it slightly more complicated but still doable (depending on what you expect to match in nested scenarios). Something like should get you started:

var start = "[start]";
var end = "[end]";
var regEx = new Regex(String.Format("{0}(.*){1}", Regex.Escape(start), Regex.Escape(end)));
var source = "[start]aaaaaa[end] wwwww [start]cccccc[end]";
var matches = regEx.Match( source );

It should be trivial to wrap the code above into a function appropriate for your needs.

You can use a regular expression, but remember to call Regex.Escape on your arguments:

public static IEnumerable<string> GetSubStrings(
   string text,
   string start,
   string end)
{
    string regex = string.Format("{0}(.*?){1}",
        Regex.Escape(start), 
        Regex.Escape(end));

    return Regex.Matches(text, regex, RegexOptions.Singleline)
        .Cast<Match>()
        .Select(match => match.Groups[1].Value);
}

I also added the SingleLine option so that it will match even if there are new-lines in your text.

I was bored, and thus I made a useless micro benchmark which "proves" (on my dataset, which has strings up to 7k of characters and <b> tags for start/end parameters) my suspicion that juharr 's solution is the fastest of the three overall.

Results (1000000 iterations * 20 test cases):

 juharr: 6371ms Jake: 6825ms Mark Byers: 82063ms 

NOTE: Compiled regex didn't speed things up much on my dataset.

Regex-free method:

public static List<string> extract_strings(string src, string start, string end)
{
    if (src.IndexOf(start) > 0)
    {
        src = src.Substring(src.IndexOf(start));
    }
    string[] array1 = src.Split(new[] { start }, StringSplitOptions.None);
    List<string> list = new List<string>();
    foreach (string value in array1)
    {
        if (value.Contains(end))
        {
            list.Add(value.Split(new[] { end }, StringSplitOptions.None)[0]);
        }
    }
    return list;
}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM