简体   繁体   English

替换所有不属于HTML标记的<和>

[英]Replace all < and > that are NOT part of an HTML tag

I have been trying to work through a RegEx that I could use to replace all < and > text strings, EXCEPT for when those strings are part of an HTML tag. 我一直在尝试通过RegEx来替换所有<和>文本字符串,但这些字符串是HTML标记的一部分时除外。

For example: 例如:

var str = "<p>The <b>value</b> <i>1</i> is < <u>2</u></p>"

Given the above example, I want a resultant string that looks like this: 给定上面的示例,我想要一个看起来像这样的结果字符串:

var str = "<p>The <b>value</b> <i>1</i> is &lt; <u>2</u></p>"

This is not easy. 这不容易。 See the authoritative answer to a related question here . 此处查看有关相关问题的权威性答案。

Regular expressions are not built for this type of parsing. 没有为这种类型的解析构建正则表达式。 Even tokenizing or dom parsing can cause problems. 甚至令牌化或dom解析也可能导致问题。 The title of your question illustrates the problem: 您的问题的标题说明了这个问题:

Replace all < and > that are NOT part of an HTML tag

How can your parser know if < and > is an <AND> tag, or simply two orphan angle brackets around the word and ? 解析器如何知道< and ><AND>标记,还是单词and周围的两个孤立的尖括号?

An HTML parser is probably your best bet, but how the orphan brackets are handled is key. HTML解析器可能是您最好的选择,但是如何处理孤立括号是关键。 Also, you would need to look for unmatched tags or illegal tags to catch cases such as the title of your question. 此外,您还需要查找不匹配的标签或非法标签以捕获诸如问题标题之类的情况。

HTML is notoriously difficult to parse using regular expressions. 众所周知,使用正则表达式很难解析HTML。 The HTML specifications are very forgiving, and browser implementations tend to be even more forgiving. HTML规范非常宽容,而浏览器实现往往更宽容。 The result of this is that attempting to match something like this using regular expressions alone is almost impossible. 这样的结果是,仅使用正则表达式尝试匹配此类内容几乎是不可能的。

Its far more robust to use a full blown HTML parser that understands all the special cases to generate a DOM, and then walk through the resulting DOM in code looking for angle brackets. 使用全面的HTML解析器了解所有特殊情况以生成DOM,然后在代码中遍历生成的DOM以查找尖括号,它的健壮性更高。

As you have tagged your question with .NET I can recommend the HTML Agility Pack for performing this type of task. 当您用.NET标记问题时,我可以推荐HTML Agility Pack来执行此类任务。

There have been several questions asked regarding how to detect text that is or is not in an HTML tag; 关于如何检测HTML标记中是否存在文本,存在几个问题。 you should be able to modify the concept to your needs. 您应该能够根据需要修改概念。

Basically, you're looking for a < that is not followed by a > , and you want to replace it with the ampersand-notated form &lt; 基本上,您要查找<后面没有> ,并希望将其替换为以&符号表示的&lt;形式&lt; . Try something like: 尝试类似的东西:

var output = Regex.Replace(input, "<(?!.*?[>])", "&lt;");

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM