简体   繁体   English

C#正则表达式解析HTML字符串并将ID添加到每个标头标记中?

[英]C# Regex to parse HTML string and add ids into each header tag?

I've got a CMS system and I need to do some auto formatting to the HTML strings before they get served up to the client. 我有一个CMS系统,我需要对HTML字符串进行一些自动格式化,然后才能将它们提供给客户端。 So in the database I may have an HTML string like this: 因此,在数据库中,我可能会有这样的HTML字符串:

> "<h2>Example Header</h2><p>Here is some text about that
> header.</p><h2>Another Header 2</h2></p>Well I got more information
> here.</p>"

I want to add an ID attribute to every H2 tag that contains the text within the H2 tag with spaces removed, which will be used for anchor links. 我想向每个H2标签添加一个ID属性,该属性包含H2标签内包含已删除空格的文本,该文本将用于锚链接。 So the above example would be turned into: 因此,上面的示例将变成:

> "<h2 id="ExampleHeader">Example Header</h2><p>Here is some text about that
> header.</p><h2 id="AnotherHeader2">Another Header 2</h2></p>Well I got more 
> information here.</p>"

So for every H2 in the string go from: 因此,字符串中的每个H2来自:

<h2>Header Example Text Right Here</h2>

To: 至:

<h2 id="HeaderExampleTextRightHere">Header Example Text Right Here</h2>

Spaces removed but otherwise the exact same text. 空格已删除,但文本完全相同。 How can I do that with regex? 我该如何使用正则表达式呢?

Is there any HTML processing library available in C#? C#中有没有可用的HTML处理库? Then please go with that. 那请继续 Regex can be handy to handle your example html. 正则表达式可以很方便地处理您的示例html。 But for complex scenario, it will not prove safe. 但是对于复杂的情况,它不会被证明是安全的。

Here is the regex/replace for your sample input. 这是样本输入的正则表达式/替换。 Remember, only for your sample input: 请记住,仅用于示例输入:

htmls = Regex.Replace(htmls, @"<h2>([^<]*)</h2>", "<h2 id=\"$1\">$1</h2>");

您可以使用:

Regex.Replace("<h2>XYZ</h2>", "<h2>(?<innerText>[^<]*)</h2>", x => string.Format("<h2 id=\"{0}\">{0}</h2>", x.Groups["innerText"]))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM