简体   繁体   English

检查HTML字符串中未打开的标签

[英]Checking a HTML string for unopened tags

I have a string as a HTML source and I want to check whether the HTML source which is string contains a tag which is not opened. 我有一个字符串作为HTML源,我想检查HTML源(字符串)是否包含未打开的标签。

For example the string below contains </u> after WAVEFORM which has no opening <u> . 例如,下面的字符串在WAVEFORM后包含</u> ,而该<u>没有开头。

WAVEFORM</u> YES, <u>NEGATIVE AUSCULTATION OF EPIGASTRUM</u> YES,

I just want to check for these types of unopened tag and then I have to append the open tag to the start of the string? 我只想检查未打开标签的这些类型,然后将打开标签附加到字符串的开头?

For this specific case you can use HTML Agility Pack to assert if the HTML is well formed or if you have tags not opened. 对于这种特定情况,您可以使用HTML Agility Pack来断言HTML的格式是否正确或是否未打开标签。

var htmlDoc = new HtmlDocument();

htmlDoc.LoadHtml(
    "WAVEFORM</u> YES, <u>NEGATIVE AUSCULTATION OF EPIGASTRUM</u> YES,");

foreach (var error in htmlDoc.ParseErrors)
{
    // Prints: TagNotOpened
    Console.WriteLine(error.Code);
    // Prints: Start tag <u> was not found
    Console.WriteLine(error.Reason); 
}

Not so easy. 没那么容易。 You can't directly use an HTML parser as it's not valid HTML, but you can't easily throw a regex at the whole thing as regexes can't cope with nesting or other HTML complications. 您不能直接使用HTML解析器,因为它不是有效的HTML,但是由于正则表达式不能应付嵌套或其他HTML复杂性,因此您不能轻易地将正则表达式整个扔掉。

Probably about the best you could do would be to use a regex to find each markup structure, eg. 您可能要做的最好的事就是使用正则表达式来找到每个标记结构,例如。 something like: 就像是:

<(\w+)(?:\s+[-\w]+(?:\s*(?:=\s*(?:"[^"]*"|'[^']*'|[^'">\s][^>\s]*)))?)*\s*>
|</(\w+)\s*>
|<!--.*?-->

Start with an empty tags-to-open list and an empty tags-to-close list. 从一个空的标签打开列表和一个空的标签关闭列表开始。 For each match in the string, look at groups 1 and 2 to see if you've got a start or end tag. 对于字符串中的每个匹配项,请查看第1组和第2组,以查看是否具有开始或结束标记。 (Or a comment, which you can ignore.) (或您可以忽略的评论。)

If you've got a start tag, you need to know if it needs closing, ie. 如果您有一个开始标签,则需要知道是否需要关闭。 if it's one of the EMPTY content-model tags like <img> . 如果它是EMPTY内容模型标签之一,例如<img> If a element is EMPTY , it doesn't need closing so you can ignore it. 如果元素为EMPTY ,则不需要关闭它,因此您可以忽略它。 (If you have XHTML, this is all a bit easier.) (如果您拥有XHTML,这会更容易一些。)

If you have a start-tag, add the tag name in the regex group to the tags-to-close list. 如果您有开始标签,则将正则表达式组中的标签名称添加到要关闭的标签列表中。 If you've got an end tag, take one tag off the end of the tags-to-close list (it should be the same tag name as was on there, otherwise you've got invalid markup. If there are no tags on the tags-to-close list, instead add the tag name to the tags-to-open list. 如果您有结束标签,请从标签关闭列表的末尾删除一个标签(标签名称应与此处的标签名称相同,否则标记无效。如果标签上没有标签)标签关闭列表,而是将标签名称添加到标签打开列表。

Once you've got to the end of the input string, prepend each of the tags-to-open tags to the string in reverse order, and append the close tags for the the tags-to-close to the end, again in reverse order. 一旦到达输入字符串的末尾,则将每个打开标签的标签以相反的顺序添加到字符串中,然后将要关闭标签的关闭标签附加到末尾,再次相反订购。

(Yeah, I'm parsing HTML with regex. I think the nastiness of this demonstrates why you don't want to. If there's anything you can do to avoid having already snipped your markup in the middle of a tag, do that.) (是的,我正在用正则表达式解析HTML。我认为这很让人讨厌,这说明了您为什么不想这样做。如果有什么可以做的事情,可以避免已经在标记中间剪断您的标记,请执行此操作。)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM