正则表达式转义非html标签的尖括号

Question

I have an html based text (with html tags), I want to find words that occur within angle brackets and replace the brackets with < and > or even when angle brackets are used as math symobls 我有一个基于html的文本（带有html标签），我想查找出现在尖括号中的单词，并用<和>替换尖括号，甚至当尖括号用作数学符号时

eg: 例如：

String text= "Hello, <b> Whatever <br /> <table> <tr> <td width="300px"> 
              1 < 2 This is a <test> </td> </tr> </table>";

I want this to be : 我希望这是：

Hello,  <b> Whatever <br /> <table>  <tr> <td width="300px"> 
1 &lt; 2 This is a &lt; test &gt; </td> </tr> </table>

THANKS in advance 提前致谢

Answer 1

I would suggest you to use Html Cleaner 我建议您使用HTML Cleaner

If you look at the HomePage the example shows exactly how text is escaped. 如果查看HomePage，该示例将准确显示文本如何转义。

<td><a href=index.html>1 -> Home Page</a>

is converted in 转换成

<td>
   <a href="index.html">1 -&gt; Home Page</a>
</td>

it will normalize your html to conform to standard xHtml. 它将规范化您的html以符合标准xHtml。 I used it in the past and (IMHO) it's pretty solid and more reliable than jTidy&Co. 我过去曾用过它，但（IMHO）它比jTidy＆Co更可靠且更可靠。 (and of course it's better then use regex or replace strategies...) （当然最好使用正则表达式或替换策略...）

Answer 2

Please see RegEx match open tags except XHTML self-contained tags and don't use regex to parse html. 请参阅RegEx匹配打开的标签（XHTML自包含标签除外），并且不要使用regex解析html。 Use a SGML parser but don't use regex. 使用SGML解析器，但不要使用正则表达式。 It would fail to often. 它经常会失败。 HTML isn't a regular language. HTML不是常规语言。

Answer 3

If it were not for CSS, Javascript, and CData sections, it would be possible. 如果不是CSS，Javascript和CData部分，则有可能。

If you are only dealing with a subset of HTML, you could make the assumption that angle brackets not surrounded by valid element identifier characters can be encoded. 如果仅处理HTML的子集，则可以假设可以对未用有效元素标识符字符包围的尖括号进行编码。

Something like "<(?=[^A-Za-z_:0-9/])" -> "<" and "(?<=[^A-Za-z_:0-9/])>" -> ">" 类似于“ <（？= [^ A-Za-z_：0-9 /]）”->“ <”和“（？<= [^ A-Za-z_：0-9 /]）>”- >“>”

But, unless you are generating the HTML yourself and KNOW that it has no embedded CSS, javascript, CData, or object sections... 但是，除非您自己生成HTML并知道它没有嵌入式CSS，javascript，CData或对象部分，否则...

As fraido said, don't use regular expressions for non-regular languages. 就像fraido所说的，不要对非规则语言使用正则表达式。

Answer 4

As everyone says, you shouldn't rely on Regular Expressions to parse HTML. 众所周知，您不应该依赖正则表达式来解析HTML。 They simply can't do it. 他们根本做不到。 But, in my case, I wanted to capture any angle brackets that didn't look like they were in an HTML tag, and escape them. 但是，就我而言，我想捕获看起来好像不在HTML标记中的任何尖括号，然后将其转义。 Since everything was going through a sanitizer afterwards security wasn't a concern, and the results just needed to be good enough to catch most situations, not all. 由于事后所有事情都要经过消毒器处理，因此安全性不是问题，而且结果只需要足够好就可以捕获大多数情况，而不是全部。

You need a Regexp Library that supports zero-width lookahead assertions. 您需要一个支持零宽度超前声明的Regexp库。 In my case, that was Oniguruma in Ruby 1.8. 就我而言，这就是Ruby 1.8中的Oniguruma。

To match the less than symbols (<), I did: 为了匹配小于符号（<），我做到了：

/<(?!(/?[A-Za-z_:0-9]+\s?/?>))/

Matching the greater than (>) symbols is harder. 匹配大于（>）符号比较困难。 Most libraries don't support zero-width lookbehind assertions of a variable length. 大多数库不支持可变长度的零宽度后置断言。 So you cheat: reverse the string, run a lookahead assertion, and reverse it back afterwards, using the following pattern: 因此，您作弊：反转字符串，运行先行断言，然后使用以下模式反转其后：

>(?!(/?\s?[A-Za-z_:0-9]+/?<))

So, my code looks a bit like: 因此，我的代码看起来像：

match_less_than = Oniguruma::ORegexp.new('<(?!(/?[A-Za-z_:0-9]+\s?/?>))')
match_less_than.gsub!(string, '&lt;')

match_greater_than = Oniguruma::ORegexp.new('>(?!(/?\s?[A-Za-z_:0-9]+/?<))')
string = match_greater_than.gsub(string.reverse, '&gt;'.reverse).reverse

Nasty, huh? 讨厌吧？

正则表达式转义非html标签的尖括号

问题描述

4 个解决方案

解决方案1
3 2010-03-22 15:40:43

解决方案2
1 2010-03-22 15:43:05

解决方案3
0 2010-03-22 16:04:56

解决方案4
0 2010-11-01 11:14:33

正则表达式转义非html标签的尖括号

问题描述

4 个解决方案

解决方案1 3 2010-03-22 15:40:43

解决方案2 1 2010-03-22 15:43:05

解决方案3 0 2010-03-22 16:04:56

解决方案4 0 2010-11-01 11:14:33

解决方案1
3 2010-03-22 15:40:43

解决方案2
1 2010-03-22 15:43:05

解决方案3
0 2010-03-22 16:04:56

解决方案4
0 2010-11-01 11:14:33