简体   繁体   English

Javascript RegEx匹配HTML标签内的字符串

[英]Javascript RegEx matching a string inside HTML tags

I've got a pretty weird situation here. 我这里的情况很奇怪。 I have got a string that looks something like this: 我有一个看起来像这样的字符串:

<tag&nbsp;class="bla">hey&nbsp;there&nbsp;</tag>

I have to use javascript to replace all the &nbsp; 我必须使用javascript替换所有&nbsp; contained inside the HTML tags with spaces. 包含在HTML标记内,并带有空格。 There can be any amount of tags and &nbsp; 可以有任意数量的标签和&nbsp; . So it has to look like this: 因此它必须看起来像这样:

<tag class="bla">hey&nbsp;there&nbsp;</tag>

Thanks in advance, Arthur. 在此先感谢亚瑟。

Possibly not the most efficient, but should do the job: 可能不是最有效的,但应该执行以下工作:

str.replace(/<([^>]+)>/g, function(m){ return m.replace(/&nbsp;/gi, ' '); });

Which should only touch the &nbsp; 哪个应该只能触摸&nbsp; inside of <> <>内部

First off let's state again that when parsing (X)HTML with regex is the right answer, it's probably because the question is seriously messed up. 首先,让我们再次声明,使用正则表达式解析(X)HTML是正确的答案时,这可能是因为问题被严重搞砸了。 In this case you should get the guy who generated the corrupted HTML, and make him put his nose in it, then make him fix the mess. 在这种情况下,您应该让生成损坏的HTML的人让他放鼻子,然后让他修复混乱。

Otherwise, among other things, it will become your work, and you'll accept responsibility for any further mess. 否则,它将成为您的工作,并且对任何进一步的混乱承担责任。

That said, maybe the safest approach would be to look for 就是说,也许最安全的方法是寻找

<([^<>]*)&nbsp;([^<>]*)>

and replace it with <\\1 \\2> . 并将其替换为<\\1 \\2> The downside of this approach is that you will have to do this repeatedly (if you have a tag with eight &nbsp; 's inside, you'll have to iterate the replacement eight times). 这种方法的缺点是您必须重复执行此操作(如果您的标签内部有八个&nbsp; ;,则必须将替换操作重复八次)。

So you'll also need a loop that performs the replace, and if the replaced text is identical to what it was before, then you're done and may exit the loop. 因此,您还需要一个执行替换的循环,如果替换后的文本与之前的文本相同,则说明您已完成并可能退出循环。

This is not the most efficient way in terms of replacement speed, but it's more straightforward and simpler to handle. 就替换速度而言,这不是最有效的方法,但是它更直接,更容易处理。 Also it helps in remembering that this is a kludgy fix :-) 它还有助于记住这是一个麻烦的解决方法:-)

The problem described in RoToRa's comment may be fixed in this particular case by modifying the outer expression: 在这种特殊情况下,可以通过修改外部表达式来解决RoToRa注释中描述的问题:

<(\w[^<>]*)&nbsp;([^<>]*)>

so that it only accepts tags starting with a letter. 因此它只接受以字母开头的标签。 1 < 2 &nbsp; > 3 1 < 2 &nbsp; > 3 would then be rejected. 1 < 2 &nbsp; > 3则被拒绝。

The same "fix" applies to Ross McLellan's solution: 相同的“解决方案”适用于Ross McLellan的解决方案:

str.replace(/<(\w[^>]+)>/g, function(m){ return m.replace(/&nbsp;/gi, ' '); });

For performance's sake, Ross's solution is faster on small HTML chunks, and falls behind mine when the number of tags grow. 出于性能方面的考虑,Ross的解决方案在较小的HTML块上速度更快,并且在标记数量增加时落后于我的解决方案。 That's because the search overhead is marginally larger for my solution, but then mine finds far fewer matches and fewer calls to replace() are actually made. 那是因为我的解决方案的搜索开销略大,但是我的发现要少得多的匹配,并且实际上进行了更少的replace()调用。

This modification might get the best of both worlds, but I haven't tested it: 此修改可能会兼得两全,但我尚未对其进行测试:

str.replace(/<(\w[^<>]*&nbsp;[^<>]*)>/g,
    function(m) {
        return m.replace(/&nbsp;/gi, ' ');
    }
);

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM