用于解析HTML段的正则表达式（regex）

Question

I am currently trying to come up with a regular expression that will parse out something like the following: 我目前正在尝试提出一个正则表达式，它将解析如下内容：

ORIGINAL HTML: 原始HTML：

<td align="center"><p>line 1</p><p>line 2</p><p>line 3</p></td>

INTENDED HTML: 预期的HTML：

<td align="center">line 1<br />line 2<br />line 3</td>

Note that there are other ... tags throughout the HTML document that must not be touched. 请注意，整个HTML文档中还有其他...标签，这些标签不能被触摸。 I only want to replace ... within a <td> or <th> only. 我只想在<td>或<th>替换... 。

I would also need a regexp to reverse the process. 我还需要一个正则表达式来逆转该过程。 Please note that these regular expressions have to work in VB/VBScript/Classic ASP, so although I can use lookaheads (which I think is the key here), I cannot use lookbehinds. 请注意，这些正则表达式必须在VB / VBScript / Classic ASP中起作用，因此尽管我可以使用前行（我认为这是此处的关键），但不能使用后行。 Some regex's I've tried unsuccessfully are: 我尝试失败的一些正则表达式是：

1. <td[^>]*>(<p>.+<\/p>)<\/td>
2. <td[^>]*>(<p>.+<\/p>)+?<\/td>
3. <td[^>]*><p>(?:(.+?)<\/p><p>(.+))+<\/p><\/td>
4. <td[^>]*>(<p>(?:(?!<\/p>)).*<\/p>)+?<\/td>
5. <td[^>]*>(?:<p>(.+?)<\/p>)*(?:<p>(.+)<\/p>)<\/td>
6. <td[^>]*>(?:<p>(.+?)<\/p>)(?:<p>(.+)<\/p>)*(?:<p>(.+)<\/p>)<\/td>

I can "cheat" and pull out the entire line and then parse it manually usually standard VB string manipulation functions, but that's definitely not the most elegant, nor the fastest way. 我可以“欺骗”并拉出整行，然后通常使用标准的VB字符串操作函数手动对其进行解析，但这绝对不是最优雅的方法，也不是最快的方法。 There has to be some way to do this in one shot using RegEx's. 必须使用某种方式使用RegEx一次完成此操作。

Eventually I'd like to take... 最终我想服用...

<td align="center"><p><span style="color:#ff0000;"><strong>line 1</strong></span></p><p>line 2</p><p>line 3</p></td>

...and turn it into ...然后变成

<td align="center"><span style="color:#ff0000;"><strong>line 1</strong></span><br />line 2<br />line 3</td>

Any ideas (besides not to do this with a regex, lol)? 有什么想法（除了不要使用正则表达式，大声笑）？

Thank you! 谢谢！

Answer 1

Regular expression are not suited for a irregular language like HTML. 正则表达式不适用于HTML之类的不规则语言。 You should better use a proper HTML parser. 您最好使用适当的HTML解析器。

You could use PHP's DOM library : 您可以使用PHP的DOM库：

$doc = new DOMDocument();
$doc->loadHTML($code);
$xpath = new DOMXpath($doc);
forach ($xpath->query('//td/p') as $i => $elem) {  // find all P elements that are a child of a TD
    if ($i != 0) {                                  // add BR for any P except the first
        $elem->parentNode->insertBefore($doc->createElement('br'), $elem);
    }
    foreach ($elem->childNodes as $nodes) {        // move contents out of P
        $elem->parentNode->insertBefore($node, $elem);
    }
    $elem->parentNode->removeChild($elem);         // remove empty P
}

Answer 2

Here's your problem: 这是您的问题：

There has to be some way to do this in one shot using RegEx's. 必须使用某种方式使用RegEx一次完成此操作。

This is false, there is no way. 这是错误的，没有办法。 It's mathematically impossible. 从数学上讲这是不可能的。 Regular expressions, even ones with lookahead, cannot maintain state required to parse an HTML expression. 正则表达式，即使具有前瞻性的正则表达式也无法维持解析HTML表达式所需的状态。

You have to use an HTML parser. 您必须使用HTML解析器。 Many have been written, if you specify your target environment we can help you select one. 已经写了很多，如果您指定目标环境，我们可以帮助您选择一个。 For example, in .Net the HTML Agility Pack is good. 例如，在.Net中，HTML Agility Pack很不错。

Answer 3

ASP and IIS, more specifically, do support ISAPI filters, however, I didn't want or have to resort to it. 更具体地说，ASP和IIS确实支持ISAPI筛选器，但是，我不想或不必求助于它。 The HTML segment is only a string, and not part of a DOM tree (although I could've converted it to one if need be). HTML段只是一个字符串，不是DOM树的一部分（尽管我可以将其转换为一个）。

Ultimately, here's how I resolved the issue since a straight regex apparently cannot do what I want: 最终，这是我解决问题的方法，因为正则表达式显然不能满足我的要求：

RE3.Pattern = "<td[^>]*><p>.+?<\/p><\/td>"
Set Matches = RE3.Execute(it)
If Matches.Count > 0 Then
   RE3.Pattern = "<p[^>]*>"
   For Each Match In Matches
      itxt_tmp = Replace(Replace(RE3.Replace(Match.Value,""),"</p>","<br />"),"<br /></td>","</td>")
      it = Replace(it,Match.Value,itxt_tmp)
   Next
End If
Set Matches = Nothing

And to go back to the original: 并回到原始的：

RE.Pattern = "<td[^>]*>.+?<\/td>"
Set Matches = RE.Execute(itxt)
If Matches.Count > 0 Then
   For Each Match In Matches
      If InStr(1,Match.Value,"<br />") > 1 Then
         RE.Pattern = "<td([^>]*)>"
         itxt_tmp = RE.Replace(Replace(Replace(Match.Value,"<br />","</p><p>"),"</td>","</p></td>"),"<td$1><p>")
         itxt = Replace(itxt,Match.Value,itxt_tmp)
      End If
   Next
End If
Set Matches = Nothing

Probably not the fastest way, nor the best way, but it does the job. 可能不是最快的方法，也不是最好的方法，但是它确实起作用。 Whether or not this helps someone else with a similar problem, I do not know, but figured I'd toss this code segment out there just in case, anyways. 我不知道这是否对遇到类似问题的人有帮助，但我想我还是把这个代码段扔了出去，以防万一。

用于解析HTML段的正则表达式（regex）

问题描述

3 个解决方案

解决方案1
0 2011-01-18 20:09:48

解决方案2
0 2011-01-18 23:16:51

解决方案3
0 2011-01-24 18:36:40

用于解析HTML段的正则表达式（regex）

问题描述

3 个解决方案

解决方案1 0 2011-01-18 20:09:48

解决方案2 0 2011-01-18 23:16:51

解决方案3 0 2011-01-24 18:36:40

解决方案1
0 2011-01-18 20:09:48

解决方案2
0 2011-01-18 23:16:51

解决方案3
0 2011-01-24 18:36:40