简体   繁体   中英

C# Regex to parse HTML string and add ids into each header tag?

I've got a CMS system and I need to do some auto formatting to the HTML strings before they get served up to the client. So in the database I may have an HTML string like this:

> "<h2>Example Header</h2><p>Here is some text about that
> header.</p><h2>Another Header 2</h2></p>Well I got more information
> here.</p>"

I want to add an ID attribute to every H2 tag that contains the text within the H2 tag with spaces removed, which will be used for anchor links. So the above example would be turned into:

> "<h2 id="ExampleHeader">Example Header</h2><p>Here is some text about that
> header.</p><h2 id="AnotherHeader2">Another Header 2</h2></p>Well I got more 
> information here.</p>"

So for every H2 in the string go from:

<h2>Header Example Text Right Here</h2>

To:

<h2 id="HeaderExampleTextRightHere">Header Example Text Right Here</h2>

Spaces removed but otherwise the exact same text. How can I do that with regex?

Is there any HTML processing library available in C#? Then please go with that. Regex can be handy to handle your example html. But for complex scenario, it will not prove safe.

Here is the regex/replace for your sample input. Remember, only for your sample input:

htmls = Regex.Replace(htmls, @"<h2>([^<]*)</h2>", "<h2 id=\"$1\">$1</h2>");

您可以使用:

Regex.Replace("<h2>XYZ</h2>", "<h2>(?<innerText>[^<]*)</h2>", x => string.Format("<h2 id=\"{0}\">{0}</h2>", x.Groups["innerText"]))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM