简体   繁体   English

由于PCRE限制,正则表达式使Apache崩溃

[英]Regular expression crashes Apache due to PCRE limitations

I am currently creating bbcode parsing engine and I have encountered a situation what I can't figure out on my own. 我目前正在创建bbcode解析引擎,遇到了无法自行解决的情况。

The thing is, that I popped into a problem exactly like this one: Apache / PHP on Windows crashes with regular expression 事实是,我遇到了与此完全相同的问题: Windows上的Apache / PHP使用正则表达式崩溃

That means that if I make something like the example below Apache crashes because of recursion count reaching 690 (1MB memory limit for PCRE): 这意味着如果我做类似下面的示例,由于递归计数达到690(PCRE的内存限制为1MB),Apache崩溃了:

$txt = '[b]'.str_repeat('a', 338).'[/b]';  // if I change repeat count to lower value it's ok
$regex = '#\[(?P<attributes>(?P<tag>[a-z0-9_]*?)(?:=.*?|\s.*?|))](?P<content>(?:[^[]|\[(?!/?(?P=tag)])|(?R))+?)\[/(?P=tag)]#mi';

echo preg_replace_callback($regex, function($matches) { return $matches['content']; }, $txt);

So I need to somehow minimize the need of * and + in my regex, but that's where I'm out of ideas so I though maybe you could suggest something. 所以我需要以某种方式使正则表达式中*+的需求最小化,但这就是我的主意,因此尽管您可能会提出一些建议。

Other approaches for parsing bbcode (that could handle nested tags) are welcome. 欢迎使用其他解析bbcode的方法(可以处理嵌套标签)。 However I would not like to use an already built class or something. 但是,我不想使用已经构建的类或其他东西。 I like to do things on my own! 我喜欢自己做事!

I have also looked into PECL and Pear HTML_BBCodeParser. 我也研究了PECL和Pear HTML_BBCodeParser。 But I don't want my application to be dependent on extensions. 但是我不希望我的应用程序依赖于扩展。 More likely I may do some script that checks for that extension and if it doesn't exist use the BBCode parser that I'm trying to do here. 我更可能执行一些脚本来检查该扩展名,如果不存在该扩展名,请使用我在此处尝试执行的BBCode解析器。

Sorry if my descriptions are gloomy, I'm not pro at English ^^ 抱歉,如果我的描述令人沮丧,我不是英语专家^^

EDIT. 编辑。 So the regex explained: 所以正则表达式解释:

\[(?P<attributes>(?P<tag>[a-z0-9_]*?)(?:=.*?|\s.*?|))]

This is my opening tag. 这是我的开场白。 I have used named groups. 我使用了命名组。 With 'tag' I identify tag and with 'attributes' I identify tags attributes. 使用“标签”可以识别标签,使用“属性”可以识别标签属性。 Think of tag as an attribute also. 也可以将标签视为属性。 So what is happening here? 那么这里发生了什么? I try to match a tag, when a tag is matched, I try to match anything after = sign or anything after \\s (spacer) until it reaches tag closure ] . 我尝试匹配一个标签,当一个标签匹配时,我尝试匹配=号之后的任何内容或\\s (空格)之后\\s所有内容,直到达到标签关闭]为止。

(?P<content>(?:[^[]|\[(?!/?(?P=tag)])|(?R))+?)

Now here I am trying to match content. 现在,我在这里尝试匹配内容。 This is the tricky part. 这是棘手的部分。 I am looking for any character that is not [ and if I find any, then I check if it is not my ending tag or recursion, and I tell the regex engine to do so until.... 我正在寻找不是[的任何字符,如果找到任何字符,那么我检查它是否不是我的结束标记或递归,然后我告诉正则表达式引擎这样做直到...。

\[/(?P=tag)]

... the ending tag is found. ...找到结束标签。

Your regex, especially the zero-width assertions (lookaround) cause the regex engine to backtrack catastrophically. 您的正则表达式,尤其是零宽度的断言(环视)导致正则表达式引擎发生灾难性的回退。 Moral of the story: Regex can't shouldn't be used to parse languages that are not regular. 这个故事的寓意:正则表达式不能不应用来解析不属于正规语言。 If you have nested structures, that's not a regular language. 如果您有嵌套结构,则不是常规语言。

In fact, I think BBCode is evil . 实际上,我认为BBCode是邪恶的 BBCode is a markup language invented by lazy programmers who didn't want to filter HTML the proper way. BBCode是一种懒惰的程序员发明的标记语言,他们不想以正确的方式过滤HTML。 As a result, we now have a loose "standard" that's hard to implement. 结果,我们现在有了一个难以实施的宽松“标准”。 Filter your HTML the right way: 正确过滤HTML:

http://htmlpurifier.org/ http://htmlpurifier.org/

I was going to suggest a BBCodeParser... 我打算建议一个BBCodeParser ...

I have also looked into PECL and Pear HTML_BBCodeParser. 我也研究了PECL和Pear HTML_BBCodeParser。 But i don't want my application to be dependant on extensions 但是我不希望我的应用程序依赖于扩展

I find that to be very strange. 我觉得这很奇怪。 Why reinvent the wheel? 为什么要重新发明轮子? One of the principles of good software-engineering is DRY (Don't Repeat Yourself). 良好的软件工程设计原则之一是DRY(不要重复自己)。 You're trying to solve a problem that has already been solved. 您正在尝试解决已经解决的问题。

I like to do things on my own! 我喜欢自己做事!

That's not bad in of itself, but there are times when you are better off using a tried and true solution; 这本身并不坏,但是有时您会使用久经考验的真实解决方案变得更好。 one that is better tested and more robust than your own (as you're finding out). 一个比您自己的产品更好地测试并且更坚固的产品(如您所知)。 That way you will spend time on the problem you actually want to solve instead of solving a problem that has already been solved. 这样,您将花费时间在实际要解决的问题上,而不是解决已经解决的问题。 Don't fall into the trap of reinventing the wheel. 不要陷入重新发明轮子的陷阱。 :) :)

My suggestion (and solution) to you is to use a BBCode parser. 我对您的建议(和解决方案)是使用BBCode解析器。

EDIT 编辑

Another thing is that you're parsing something that is HTML-like. 另一件事是您正在解析类似于HTML的内容。 Things of that nature don't lend themselves easily to being parsed by regular expressions. 这种性质的东西不容易被正则表达式解析。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM