检查HTML字符串中未打开的标签

Question

I have a string as a HTML source and I want to check whether the HTML source which is string contains a tag which is not opened. 我有一个字符串作为HTML源，我想检查HTML源（字符串）是否包含未打开的标签。

For example the string below contains </u> after WAVEFORM which has no opening <u> . 例如，下面的字符串在WAVEFORM后包含</u> ，而该<u>没有开头。

WAVEFORM</u> YES, <u>NEGATIVE AUSCULTATION OF EPIGASTRUM</u> YES,

I just want to check for these types of unopened tag and then I have to append the open tag to the start of the string? 我只想检查未打开标签的这些类型，然后将打开标签附加到字符串的开头？

Answer 1

For this specific case you can use HTML Agility Pack to assert if the HTML is well formed or if you have tags not opened. 对于这种特定情况，您可以使用HTML Agility Pack来断言HTML的格式是否正确或是否未打开标签。

var htmlDoc = new HtmlDocument();

htmlDoc.LoadHtml(
    "WAVEFORM</u> YES, <u>NEGATIVE AUSCULTATION OF EPIGASTRUM</u> YES,");

foreach (var error in htmlDoc.ParseErrors)
{
    // Prints: TagNotOpened
    Console.WriteLine(error.Code);
    // Prints: Start tag <u> was not found
    Console.WriteLine(error.Reason); 
}

Answer 2

Not so easy. 没那么容易。 You can't directly use an HTML parser as it's not valid HTML, but you can't easily throw a regex at the whole thing as regexes can't cope with nesting or other HTML complications. 您不能直接使用HTML解析器，因为它不是有效的HTML，但是由于正则表达式不能应付嵌套或其他HTML复杂性，因此您不能轻易地将正则表达式整个扔掉。

Probably about the best you could do would be to use a regex to find each markup structure, eg. 您可能要做的最好的事就是使用正则表达式来找到每个标记结构，例如。 something like: 就像是：

<(\w+)(?:\s+[-\w]+(?:\s*(?:=\s*(?:"[^"]*"|'[^']*'|[^'">\s][^>\s]*)))?)*\s*>
|</(\w+)\s*>
|<!--.*?-->

Start with an empty tags-to-open list and an empty tags-to-close list. 从一个空的标签打开列表和一个空的标签关闭列表开始。 For each match in the string, look at groups 1 and 2 to see if you've got a start or end tag. 对于字符串中的每个匹配项，请查看第1组和第2组，以查看是否具有开始或结束标记。 (Or a comment, which you can ignore.) （或您可以忽略的评论。）

If you've got a start tag, you need to know if it needs closing, ie. 如果您有一个开始标签，则需要知道是否需要关闭。 if it's one of the EMPTY content-model tags like <img> . 如果它是EMPTY内容模型标签之一，例如<img> 。 If a element is EMPTY , it doesn't need closing so you can ignore it. 如果元素为EMPTY ，则不需要关闭它，因此您可以忽略它。 (If you have XHTML, this is all a bit easier.) （如果您拥有XHTML，这会更容易一些。）

If you have a start-tag, add the tag name in the regex group to the tags-to-close list. 如果您有开始标签，则将正则表达式组中的标签名称添加到要关闭的标签列表中。 If you've got an end tag, take one tag off the end of the tags-to-close list (it should be the same tag name as was on there, otherwise you've got invalid markup. If there are no tags on the tags-to-close list, instead add the tag name to the tags-to-open list. 如果您有结束标签，请从标签关闭列表的末尾删除一个标签（标签名称应与此处的标签名称相同，否则标记无效。如果标签上没有标签）标签关闭列表，而是将标签名称添加到标签打开列表。

Once you've got to the end of the input string, prepend each of the tags-to-open tags to the string in reverse order, and append the close tags for the the tags-to-close to the end, again in reverse order. 一旦到达输入字符串的末尾，则将每个打开标签的标签以相反的顺序添加到字符串中，然后将要关闭标签的关闭标签附加到末尾，再次相反订购。

(Yeah, I'm parsing HTML with regex. I think the nastiness of this demonstrates why you don't want to. If there's anything you can do to avoid having already snipped your markup in the middle of a tag, do that.) （是的，我正在用正则表达式解析HTML。我认为这很让人讨厌，这说明了您为什么不想这样做。如果有什么可以做的事情，可以避免已经在标记中间剪断您的标记，请执行此操作。）

检查HTML字符串中未打开的标签

问题描述

2 个解决方案

解决方案1
6 已采纳 2010-07-02 10:35:18

解决方案2
0 2010-07-02 10:07:17

检查HTML字符串中未打开的标签

问题描述

2 个解决方案

解决方案1 6 已采纳 2010-07-02 10:35:18

解决方案2 0 2010-07-02 10:07:17

解决方案1
6 已采纳 2010-07-02 10:35:18

解决方案2
0 2010-07-02 10:07:17