Forum tags. What is the best way to implement them?

Question

I am building a forum and I want to use forum-style tags to let the users format their posts in a limited fashion. Currently I am using Regex to do this. As per this question: How to use C# regular expressions to emulate forum tags

The problem with this, is that the regex does not distinguish between nested tags. Here is a sample of how I implemented this method:

    public static string MyExtensionMethod(this string text)
    {
         return TransformTags(text);
    }

    private static string TransformTags(string input)
    {
        string regex = @"\[([^=]+)[=\x22']*(\S*?)['\x22]*\](.+?)\[/(\1)\]";
        MatchCollection matches = new Regex(regex).Matches(input);
        for (int i = 0; i < matches.Count; i++)
        {
            var tag = matches[i].Groups[1].Value;
            var optionalValue = matches[i].Groups[2].Value;
            var content = matches[i].Groups[3].Value;

            if (Regex.IsMatch(content, regex))
            {
                content = TransformTags(content);
            }

            content = HandleTags(content, optionalValue, tag);

            input = input.Replace(matches[i].Groups[0].Value, content);
        }

        return input;
    }

    private static string HandleTags(string content, string optionalValue, string tag)
    {
        switch (tag.ToLower())
        {
            case "quote":
                return string.Format("<div class='quote'>{0}</div>", content);
            default:
                return string.Empty;
        }
    }

Now, if I submit something like [quote] This user posted [quote] blah [/quote] [/quote] it does not properly detect the nested quote. Instead it takes the first opening quote tag and puts it with the first closing quote tag.

Are there any recommended solutions? Can the regex be modified to grab nested tags? Maybe I shouldn't use regex for this?

Answer 1

While using "only" regex is probably possible using balancing groups, it's pretty heavy voodoo, and it's intrinsecally "fragile". What I propose is using regexes to find open/close tags (without trying to associate the close with the open), mark and collect them in a collection (a stack probably) and then parse them "by hand" (with a foreach). In this way you have the best of both world: the searching of tags by regex and the handling of them (and of wrongly written ones) by hand.

class TagMatch
{
    public string Tag { get; set; }
    public Capture Capture { get; set; }
    public readonly List<string> Substrings = new List<string>();
}

static void Main(string[] args)
{
    var rx = new Regex(@"(?<OPEN>\[[A-Za-z]+?\])|(?<CLOSE>\[/[A-Za-z]+?\])|(?<TEXT>[^\[]+|\[)");
    var str = "Lorem [AA]ipsum [BB]dolor sit [/BB]amet, [ consectetur ][/AA]adipisici elit, sed eiusmod tempor incidunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquid ex ea commodi consequat. Quis aute iure reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint obcaecat cupiditat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.";
    var matches = rx.Matches(str);

    var recurse = new Stack<TagMatch>();
    recurse.Push(new TagMatch { Tag = String.Empty });

    foreach (Match match in matches)
    {
        var text = match.Groups["TEXT"];

        TagMatch last;

        if (text.Success)
        {
            last = recurse.Peek();
            last.Substrings.Add(text.Value);
            continue;
        }

        var open = match.Groups["OPEN"];

        string tag;

        if (open.Success)
        {
            tag = open.Value.Substring(1, open.Value.Length - 2);
            recurse.Push(new TagMatch { Tag = tag, Capture = open.Captures[0] });
            continue;
        }

        var close = match.Groups["CLOSE"];

        tag = close.Value.Substring(2, close.Value.Length - 3);

        last = recurse.Peek();

        if (last.Tag == tag)
        {
            recurse.Pop();

            var lastLast = recurse.Peek();
            lastLast.Substrings.Add("**" + last.Tag + "**");
            lastLast.Substrings.AddRange(last.Substrings);
            lastLast.Substrings.Add("**/" + last.Tag + "**");
        }
        else
        {
            throw new Exception();
        }
    }

    if (recurse.Count != 1)
    {
        throw new Exception();
    }

    var sb = new StringBuilder();
    foreach (var str2 in recurse.Pop().Substrings)
    {
        sb.Append(str2);
    }

    var str3 = sb.ToString();
}

This is an example. It's case sensitive (but it's easy to solve this problem). It doesn't handle "unpaired" tags, because there are various ways to handle them. Where you find a "throw new Exception" you'll have to add your handling. Clearly this isn't a "drop in" solution. It's only an example. By this logic, I won't respond to questions like "the compiler tells me I need a namespace" or "the compiler can't find Regex". BUT I will be more-than-happy to respond to "advanced" questions, like how could unpaired tags be matched, or how could you add support for [AAA=bbb] tags

(2nd BIG EDIT)

Bwahahahah! I DID know groupings were the way to do it!

// Some classes

class BaseTagMatch {
    public Capture Capture;

    public override string ToString()
    {
        return String.Format("{1}: {2} [{0}]", GetType(), Capture.Index, Capture.Value.ToString());
    }
}

class BeginTag : BaseTagMatch
{
    public int Index;
    public Capture Options;
    public EndTag EndTag;
}

class EndTag : BaseTagMatch {
    public int Index;
    public BeginTag BeginTag;
}

class Text : BaseTagMatch
{
}

class UnmatchedTag : BaseTagMatch
{
}

// The code

var pattern =
    @"(?# line 01) ^" +
    @"(?# line 02) (" +
    // Non [ Text
    @"(?# line 03)   (?>(?<TEXT>[^\[]+))" +
    @"(?# line 04)   |" +
    // Immediately closed tag [a/]
    @"(?# line 05)   (?>\[  (?<TAG>  [A-Z]+  )  \x20*  =?  \x20*  (?<TAG_OPTION>(  (?<=  =  \x20*)  (  (?!  \x20*  /\])  [^\[\]\r\n]  )*  )?  )  (?<BEGIN_INNER_TEXT>)(?<END_INNER_TEXT>)  \x20*  /\]  )" +
    @"(?# line 06)   |" +
    // Matched open tag [a]
    @"(?# line 07)   \[  (?<TAG>  (?<OPEN>  [A-Z]+  )  )  \x20* =?  \x20* (?<TAG_OPTION>(  (?<=  =  \x20*)  (  (?!  \x20*  \])  [^\[\]\r\n]  )*  )?  )  \x20*  \]  (?<BEGIN_INNER_TEXT>)" +
    @"(?# line 08)   |" +
    // Matched close tag [/a]
    @"(?# line 09)   (?>(?<END_INNER_TEXT>)  \[/  \k<OPEN>  \x20*  \]  (?<-OPEN>))" +
    @"(?# line 10)   |" +
    // Unmatched open tag [a]
    @"(?# line 11)   (?>(?<UNMATCHED_TAG>  \[  [A-Z]+  \x20* =?  \x20* (  (?<=  =  \x20*)  (  (?!  \x20*  \])  [^\[\]\r\n]  )*  )?  \x20*  \]  )  )" +
    @"(?# line 12)   |" +
    // Unmatched close tag [/a]
    @"(?# line 13)   (?>(?<UNMATCHED_TAG>  \[/  [A-Z]+  \x20*  \]  )  )" +
    @"(?# line 14)   |" +
    // Single [ of Text (unmatched by other patterns)
    @"(?# line 15)   (?>(?<TEXT>\[))" +
    @"(?# line 16) )*" +
    @"(?# line 17) (?(OPEN)(?!))" +
    @"(?# line 18) $";

var rx = new Regex(pattern, RegexOptions.IgnorePatternWhitespace | RegexOptions.ExplicitCapture | RegexOptions.IgnoreCase);

var match = rx.Match("[div=c:max max]asdf[p = 1   ] a [p=2] [b  =  p/pp   /] [q/] \n[a]sd [/z]  [ [/p]f[/p]asdffds[/DIV] [p][/p]");

////var tags = match.Groups["TAG"].Captures.OfType<Capture>().ToArray();
////var tagoptions = match.Groups["TAG_OPTION"].Captures.OfType<Capture>().ToArray();
////var begininnertext = match.Groups["BEGIN_INNER_TEXT"].Captures.OfType<Capture>().ToArray();
////var endinnertext = match.Groups["END_INNER_TEXT"].Captures.OfType<Capture>().ToArray();
////var text = match.Groups["TEXT"].Captures.OfType<Capture>().ToArray();
////var unmatchedtag = match.Groups["UNMATCHED_TAG"].Captures.OfType<Capture>().ToArray();

var tags = match.Groups["TAG"].Captures.OfType<Capture>().Select((p, ix) => new BeginTag { Capture = p, Index = ix, Options = match.Groups["TAG_OPTION"].Captures[ix] }).ToList();

Func<Capture, int, EndTag> func = (p, ix) =>
{
    var temp = new EndTag { Capture = p, Index = ix, BeginTag = tags[ix] };
    tags[ix].EndTag = temp;
    return temp;
};

var endTags = match.Groups["END_INNER_TEXT"].Captures.OfType<Capture>().Select((p, ix) => func(p, ix));
var text = match.Groups["TEXT"].Captures.OfType<Capture>().Select((p, ix) => new Text { Capture = p });
var unmatchedTags = match.Groups["UNMATCHED_TAG"].Captures.OfType<Capture>().Select((p, ix) => new UnmatchedTag { Capture = p });

// Here you have all the tags and the inner text neatly ordered and ready to be recomposed in a StringBuilder.
var allTags = tags.Cast<BaseTagMatch>().Union(endTags).Union(text).Union(unmatchedTags).ToList();
allTags.Sort((p, q) => p.Capture.Index - q.Capture.Index);

foreach (var el in allTags)
{
    var type = el.GetType();

    if (type == typeof(BeginTag))
    {

    }
    else if (type == typeof(EndTag))
    {

    }
    else if (type == typeof(UnmatchedTag))
    {

    }
    else
    {
        // Text
    }
}

Case insensitive tag matching, ignores tags not correctly closed, supports immediately closed tags ( [BR/] ). And someone told it wasn't possible with Regex.... Bwahahahahah!

TAG , TAGOPTION , BEGIN_INNER_TEXT and END_INNER_TEXT are matched (they always have the same number of elements). TEXT and UNMATCHED_TAG AREN'T matched! TAG and TAG_OPTION are auto-explicative (both are stripped of useless spaces). BEGIN_INNER_TEXT and END_INNER_TEXT captures are always empty, but you can use their Index property to see where the tags begin/end. UNMATCHED_TAG contains the tags that have been opened but not closed, or closed but not opponed. It doesn't contain tags that are wrong in format (for example [ 123 ]).

In the end I take the TAG , END_INNER_TEXT (to see where the tags end), TEXT and UNMATCHED_TAG and sort them by index. Then you can take the allTags , put it in a foreach and for each element test its type. Easy :-) :-)

As a small note, the Regex is RegexOptions.IgnorePatternWhitespace | RegexOptions.ExplicitCapture | RegexOptions.IgnoreCase RegexOptions.IgnorePatternWhitespace | RegexOptions.ExplicitCapture | RegexOptions.IgnoreCase RegexOptions.IgnorePatternWhitespace | RegexOptions.ExplicitCapture | RegexOptions.IgnoreCase . The first two are to make it easier to write and to read, the third one is semanthical. It makes [A] match with [/a] .

Necessary readings:

http://www.codeproject.com/KB/recipes/Nested_RegEx_explained.aspx http://www.codeproject.com/KB/recipes/RegEx_Balanced_Grouping.aspx

Answer 2

I'm not sure where regex is going to benefit you. It'd be very basic, but you could just replace [quote] with <div class="quote"> and [/quote] with </div> . The same could be said for all of the other bbcode-style tags you want to allow.

In other words, literally translate them in to the html you want them to represent.

Forum tags. What is the best way to implement them?

Question

2 answers

solution1
2 ACCPTED 2011-02-24 21:24:47

solution2
0 2011-02-24 19:00:04

Forum tags. What is the best way to implement them?

Question

2 answers

solution1 2 ACCPTED 2011-02-24 21:24:47

solution2 0 2011-02-24 19:00:04

solution1
2 ACCPTED 2011-02-24 21:24:47

solution2
0 2011-02-24 19:00:04