Which regexp should be used to replace bbcode-style tags with HTML tags

Question

I want to replace some specific letters (got from user input) to replace with some specific html tags like ,,,etc . I am using some regexps in javascript, but can not make out which use best. I am using

/\[u\](.*?)\[u\]/g // replace with <u>$1</u>
/*
 * if i type [u]underline[][u] //this allows '[]' braces
*/

or should I use

/\[u\]\([^\[u\]]+)\[u\]/g // this doesn't allow third braces to be underlined

I am also using the same regexps in php. I am confused which type of regexp use would be safe from xss attack.

Answer 1

No regexes should be used. Find a decent bbcode parser (for instance, PHP's BBCode ) and use it. trying to parse HTML or any established markup language with Regex yourself is asking for pain, trouble, and insecurity.

bobince wrote an epic answer about parsing HTML with regexes, which is relevant here as well and always worth a read.

Answer 2

You asked, whether to use /\\[u\\](.*?)\\[u\\]/g or /\\[u\\]\\([^\\[u\\]]+)\\[u\\]/g . Both patterns are not designed with an ending-tag, which is important. [u]underlined text[/u] is BBCode

A solution using extended regex could be the use of recursive patterns . I think there is no support in JavaScript yet , but works fine eg with PHP which uses PCRE .

The problem: Tags can be nested and this will make it difficult, to match the outermost ones.

Understand, what the following patterns do in this PHP example:

$str = 
'The [u][u][u]young[/u] quick[/u] brown[/u] fox jumps over the [u]lazy dog[/u]';

1.) Matching any character in [u]...[/u] using the dot non-greedy

$pattern = '~\[u\](.*?)\[/u\]~';
$str = preg_replace($pattern, '<u>\1</u>', $str);
echo htmlspecialchars($str);

outputs :

The [u][u]young quick[/u] brown[/u] fox jumps over the lazy dog

Looks for the first occurence of [u] and eats up as few characters as possible to meet the conditional [/u] which results in tag-mismatches. So this is a bad choice.

2.) Using negation of square brackets [^[\\]] for what is inside [u]...[/u]

$pattern = '~\[u\]([^[\]]*)\[/u\]~';
$str = preg_replace($pattern, '<u>\1</u>', $str);
echo htmlspecialchars($str);

outputs :

The [u][u]young quick[/u] brown[/u] fox jumps over the lazy dog

It looks for the first occurence of [u] followed by any amount of characters, that are not [ or ] to meet the conditional [/u] . It is "safer" as it only matches the innermost elements but still would require additonal effort to resolve this from inside out.

3.) Using recursion + negation of square brackets [^[\\]] for what is inside [u]...[/u]

$pattern = '~\[u\]((?:[^[\]]+|(?R))*)\[/u\]~';
$str = preg_replace($pattern, '<u>\1</u>', $str);
echo htmlspecialchars($str);

outputs :

The [u][u]young[/u] quick[/u] brown fox jumps over the lazy dog

Similar to the the second pattern: Look for the first occurence of [u] but then EITHER match one or more characters, that are not [ or ] OR paste the whole pattern at (?R) . Do the whole thing zero or more times until the conditional [/u] is matched.

To get rid of the remaining bb-tags inside, that were not resolved, we now can easily remove them:

$str = preg_replace('~\[/?u\]~',"",$str);

And got it as desired:

outputs : The young quick brown fox jumps over the lazy dog

For sure there are different ways achieving it, like preg replace callback or for JavaScript the replace() method that can use a callback as replacement.

Which regexp should be used to replace bbcode-style tags with HTML tags

Question

2 answers

solution1
1 2014-01-03 20:19:16

solution2
0 2014-01-03 23:53:01

Which regexp should be used to replace bbcode-style tags with HTML tags

Question

2 answers

solution1 1 2014-01-03 20:19:16

solution2 0 2014-01-03 23:53:01

solution1
1 2014-01-03 20:19:16

solution2
0 2014-01-03 23:53:01