简体   繁体   English

.NET正则表达式无限循环

[英].NET Regular Expressions in Infinite Cycle

I'm using .NET Regular Expressions to strip HTML code. 我正在使用.NET正则表达式剥离HTML代码。

Using something like: 使用类似:

<title>(?<Title>[\w\W]+?)</title>[\w\W]+?<div class="article">(?<Text>[\w\W]+?)</div>

This works for 99% of the time, but sometimes, when parsing... 这在99%的时间内都有效,但是有时在解析时...

Regex.IsMatch(HTML, Pattern)

The parser just blocks and it will continue on this line of code for several minutes or indefinitely. 解析器只是阻塞,它将在此行代码上持续几分钟或无限期。

What's going on? 这是怎么回事?

Your regex will work just fine when your HTML string actually contains HTML that fits the pattern. 当您的HTML字符串实际上包含适合该模式的HTML时,您的正则表达式将正常工作。 But when your HTML does not fit the pattern, eg if the last tag is missing, your regex will exhibit what I call " catastrophic backtracking ". 但是,当您的HTML不适合该模式时,例如,如果缺少最后一个标记,则您的正则表达式将显示我所谓的“ 灾难性回溯 ”。 Click that link and scroll down to the "Quickly Matching a Complete HTML File" section. 单击该链接,然后向下滚动到“快速匹配完整的HTML文件”部分。 It describes your problem exactly. 它准确地描述了您的问题。 [\\w\\W]+? [\\ w \\ W] +? is a complicated way of saying .+? 。+是一个复杂的说法吗? with RegexOptions.SingleLine. 使用RegexOptions.SingleLine。

With some effort, you can make regex work on html - however, have you looked at the HTML agility pack ? 通过一些努力,您可以使regex在html上工作-但是,您是否看过HTML敏捷包 This makes it much easier to work with html as a DOM, with support for xpath-type queries etc (ie "//div[@class='article']"). 通过支持xpath类型的查询等(即“ // div [@ class ='article']”),这使得使用html作为DOM变得更加容易。

You're asking your regex to do a lot there. 您要让正则表达式在其中做很多事情。 After every character, it has to look ahead to see if the next bit of text can be matched with the next part of the pattern. 在每个字符之后,它必须向前看是否文本的下一部分可以与模式的下一部分匹配。

Regex is a pattern matching tool. 正则表达式是一种模式匹配工具。 Whilst you can use it for simple parsing, you'd be better off using a specific parser (such as the HTML Agility pack, as mentioned my Marc). 虽然您可以使用它进行简单的解析,但是最好使用特定的解析器(如我的Marc提到的HTML Agility pack)。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM