简体   繁体   中英

HTML Sanitization - bad markup?

I've been scanning some off the discussions on sanitizing HTML markup strings for redisplay on a page (eg blog comments). In the past I've just unilaterally escaped the markup for re-display.

Does anyone know if there are any solutions out there that go beyond just removing "unsafe" tags?

What if the markup is invalid? For example, how do you prevent and unclosed <b> tag from bold facing all the text that follows it on in on the page?

It seems like Stackoverflow handles this.

Example of unclosed 'b' tag

Thanks.

Stackoverflow either uses textile or something very much like it.

Textile is more or less guaranteed to spit out valid (x)html, ameliorating many typical problems with sanitizing user input.

The Html Agility Pack is probably a good starting point as it claims to be very tolerant of badly formatted and malformed HTML. On top of that you'll may want to build some rules to do further sanitization. In the end you serialize the obtained DOM back to plain HTML code.

I faced the same problem you did and built such a rule based HTML sanitizer on top of the Html Agility Pack. It allows you to flatten or remove tags, transform tags for example replacing b with strong tags and restrict attribute usage. Take a look at the source code code of the HtmlRuleSanitizer for ideas or just get the NuGet package if you want to be done quickly.

Check this code:

Sanitize HTML , I think StackOverflow uses it somewhere...

A method to sanitize any potentially dangerous tags from the provided raw HTML input using a whitelist based approach, leaving the "safe" HTML tags.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM