C＃正则表达式解析HTML字符串并将ID添加到每个标头标记中？

Question

我有一个CMS系统，我需要对HTML字符串进行一些自动格式化，然后才能将它们提供给客户端。 因此，在数据库中，我可能会有这样的HTML字符串：

> "<h2>Example Header</h2><p>Here is some text about that
> header.</p><h2>Another Header 2</h2></p>Well I got more information
> here.</p>"

我想向每个H2标签添加一个ID属性，该属性包含H2标签内包含已删除空格的文本，该文本将用于锚链接。 因此，上面的示例将变成：

> "<h2 id="ExampleHeader">Example Header</h2><p>Here is some text about that
> header.</p><h2 id="AnotherHeader2">Another Header 2</h2></p>Well I got more 
> information here.</p>"

因此，字符串中的每个H2来自：

<h2>Header Example Text Right Here</h2>

至：

<h2 id="HeaderExampleTextRightHere">Header Example Text Right Here</h2>

空格已删除，但文本完全相同。 我该如何使用正则表达式呢？

Answer 1

C＃中有没有可用的HTML处理库？ 那请继续 正则表达式可以很方便地处理您的示例html。 但是对于复杂的情况，它不会被证明是安全的。

这是样本输入的正则表达式/替换。 请记住，仅用于示例输入：

htmls = Regex.Replace(htmls, @"<h2>([^<]*)</h2>", "<h2 id=\"$1\">$1</h2>");

Answer 2

您可以使用：

Regex.Replace("<h2>XYZ</h2>", "<h2>(?<innerText>[^<]*)</h2>", x => string.Format("<h2 id=\"{0}\">{0}</h2>", x.Groups["innerText"]))

C＃正则表达式解析HTML字符串并将ID添加到每个标头标记中？

问题描述

2 个解决方案

解决方案1
2 2014-03-27 16:35:16

解决方案2
1 已采纳 2014-03-27 16:37:40

C＃正则表达式解析HTML字符串并将ID添加到每个标头标记中？

问题描述

2 个解决方案

解决方案1 2 2014-03-27 16:35:16

解决方案2 1 已采纳 2014-03-27 16:37:40

解决方案1
2 2014-03-27 16:35:16

解决方案2
1 已采纳 2014-03-27 16:37:40