简体   繁体   English

应该使用哪个正则表达式用HTML标记替换bbcode样式的标记

[英]Which regexp should be used to replace bbcode-style tags with HTML tags

I want to replace some specific letters (got from user input) to replace with some specific html tags like <b>,<u>,<i>,etc . 我想替换一些特定的字母(got from user input)以替换为特定的html标签,例如<b>,<u>,<i>,etc I am using some regexps in javascript, but can not make out which use best. 我在javascript中使用了一些正则表达式,但无法确定哪种使用最好。 I am using 我在用

/\[u\](.*?)\[u\]/g // replace with <u>$1</u>
/*
 * if i type [u]underline[][u] //this allows '[]' braces
*/

or should I use 还是我应该使用

/\[u\]\([^\[u\]]+)\[u\]/g // this doesn't allow third braces to be underlined

I am also using the same regexps in php. 我也在php中使用相同的正则表达式。 I am confused which type of regexp use would be safe from xss attack. 我很困惑哪种类型的正则表达式可以免受xss攻击。

No regexes should be used. 不应使用正则表达式。 Find a decent bbcode parser (for instance, PHP's BBCode ) and use it. 找到一个不错的bbcode解析器(例如PHP的BBCode )并使用它。 trying to parse HTML or any established markup language with Regex yourself is asking for pain, trouble, and insecurity. 自己尝试用Regex解析HTML或任何已建立的标记语言的过程中,会带来痛苦,麻烦和不安全感。

bobince wrote an epic answer about parsing HTML with regexes, which is relevant here as well and always worth a read. bobince写了一个关于用正则表达式解析HTML的史诗般的答案 ,这在这里也很重要,并且总是值得一读。

You asked, whether to use /\\[u\\](.*?)\\[u\\]/g or /\\[u\\]\\([^\\[u\\]]+)\\[u\\]/g . 您询问是否使用/\\[u\\](.*?)\\[u\\]/g/\\[u\\]\\([^\\[u\\]]+)\\[u\\]/g Both patterns are not designed with an ending-tag, which is important. 两种模式均未设计有结尾标签,这一点很重要。 [u]underlined text[/u] is BBCode [u]underlined text[/u]BBCode

A solution using extended regex could be the use of recursive patterns . 使用扩展正则表达式的解决方案可能是使用递归模式 I think there is no support in JavaScript yet , but works fine eg with PHP which uses PCRE . 我认为JavaScript还没有支持 ,但是可以正常使用,例如,使用PCRE的 PHP

The problem: Tags can be nested and this will make it difficult, to match the outermost ones. 问题: 标签可以嵌套 ,这将使其难以匹配最外层的标签


Understand, what the following patterns do in this PHP example: 了解此PHP示例中以下模式的作用:

$str = 
'The [u][u][u]young[/u] quick[/u] brown[/u] fox jumps over the [u]lazy dog[/u]';

1.) Matching any character in [u]...[/u] using the dot non-greedy 1.)使用非贪心 匹配 [u]...[/u]中的任何字符

$pattern = '~\[u\](.*?)\[/u\]~';
$str = preg_replace($pattern, '<u>\1</u>', $str);
echo htmlspecialchars($str);

outputs : 输出

The <u>[u][u]young</u> quick[/u] brown[/u] fox jumps over the <u>lazy dog</u>

Looks for the first occurence of [u] and eats up as few characters as possible to meet the conditional [/u] which results in tag-mismatches. 查找[u]的第一个出现,并吃掉尽可能少的字符以满足条件[/u] ,这会导致标签不匹配。 So this is a bad choice. 因此,这是一个不好的选择。


2.) Using negation of square brackets [^[\\]] for what is inside [u]...[/u] 2.)对[u]...[/u]内的内容使用方括号 [^[\\]] 取反

$pattern = '~\[u\]([^[\]]*)\[/u\]~';
$str = preg_replace($pattern, '<u>\1</u>', $str);
echo htmlspecialchars($str);

outputs : 输出

The [u][u]<u>young</u> quick[/u] brown[/u] fox jumps over the <u>lazy dog</u>

It looks for the first occurence of [u] followed by any amount of characters, that are not [ or ] to meet the conditional [/u] . 它看起来为第一次出现的[u]随后的任何字符的量,即不[]满足条件[/u] It is "safer" as it only matches the innermost elements but still would require additonal effort to resolve this from inside out. 它“更安全”,因为它只匹配最内层的元素,但仍然需要付出额外的努力才能从内而外解决。


3.) Using recursion + negation of square brackets [^[\\]] for what is inside [u]...[/u] 3.)对[u]...[/u]内的内容使用递归 +方括号[^[\\]] 否定

$pattern = '~\[u\]((?:[^[\]]+|(?R))*)\[/u\]~';
$str = preg_replace($pattern, '<u>\1</u>', $str);
echo htmlspecialchars($str);

outputs : 输出

The <u>[u][u]young[/u] quick[/u] brown</u> fox jumps over the <u>lazy dog</u>

Similar to the the second pattern: Look for the first occurence of [u] but then EITHER match one or more characters, that are not [ or ] OR paste the whole pattern at (?R) . 与第二种模式类似:查找[u]的第一次出现,然后匹配一个或多个不是[]字符,或将整个模式粘贴(?R) Do the whole thing zero or more times until the conditional [/u] is matched. 整个事情执行零次或多次,直到条件[/u]匹配为止。

To get rid of the remaining bb-tags inside, that were not resolved, we now can easily remove them: 为了摆脱里面尚未解决的bb-tag,我们现在可以轻松地将其删除:

$str = preg_replace('~\[/?u\]~',"",$str);

And got it as desired: 并按需获得:

outputs : The <u>young quick brown</u> fox jumps over the <u>lazy dog</u> 输出The <u>young quick brown</u> fox jumps over the <u>lazy dog</u>

For sure there are different ways achieving it, like preg replace callback or for JavaScript the replace() method that can use a callback as replacement. 当然,有多种方法可以实现它,例如preg replace回调,或者对于JavaScript而言,可以使用回调作为替代的replace()方法

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM