简体   繁体   English

如何从尖括号内删除某些字符,而使字符独自留在外面?

[英]How can I remove certain characters from inside angle-brackets, leaving the characters outside alone?

Edit: To be clear, please understand that I am not using Regex to parse the html, that's crazy talk! 编辑:要清楚,请理解我没有使用Regex解析html,这是疯狂的说法! I'm simply wanting to clean up a messy string of html so it will parse 我只是想清理一个凌乱的html字符串,以便它将解析

Edit #2: I should also point out that the control character I'm using is a special unicode character - it's not something that would ever be used in a proper tag under any normal circumstances 编辑#2:我还应该指出,我使用的控制字符是特殊的Unicode字符-在任何正常情况下,都不会在适当的标签中使用该字符

Suppose I have a string of html that contains a bunch of control characters and I want to remove the control characters from inside tags only, leaving the characters outside the tags alone. 假设我有一个包含一堆控制字符的html字符串,并且我只想从标记内部删除控制字符,而将字符保留在标记外部。

For example 例如

Here the control character is the numeral "1". 这里的控制字符是数字“ 1”。

Input 输入

The quick 1<strong>orange</strong> lemming <sp11a1n 1class1='jumpe111r'11>jumps over</span> 1the idle 1frog

Desired Output 期望的输出

The quick 1<strong>orange</strong> lemming <span class='jumper'>jumps over</span> 1the idle 1frog

So far I can match tags which contain the control character but I can't remove them in one regex. 到目前为止,我可以匹配包含控制字符的标签,但不能在一个正则表达式中将其删除。 I guess I could perform another regex on my matches, but I'd really like to know if there's a better way. 我想我可以在比赛中执行另一个正则表达式,但是我真的很想知道是否有更好的方法。

My regex 我的正则表达式

Bear in mind this one only matches tags which contain the control character. 请记住,这只匹配包含控制字符的标签。

<(([^>])*?`([^>])*?)*?>

Thanks very much for your time and consideration. 非常感谢您的时间和考虑。

Iain Fraser 伊恩·弗雷泽(Iain Fraser)

Regex isn't the tool for this, but you can use lookbehind and lookahead to match 1 in a tag. 正则表达式不是用于此目的的工具,但是您可以使用lookbehind和lookahead来匹配标记中的1 Here it is in Java, modified to have finite lookbehind (since Java doesn't support infinite length lookbehind). 这是在Java中进行的修改,使其具有有限的后向限制(因为Java不支持无限长的后向限制)。

    String s = "123 <o123o></o1o1> <oo 11='11x'> x11 <msg136='I <3 Johnny!11'>";
    System.out.println(
        s.replaceAll("(?<=<[^<>]{0,999})(?=[^<>]+>)1", "")
    ); // prints "123 <o23o></oo> <oo ='x'> x11 <msg136='I <3 Johnny!'>

There are many cases where this will fail, but it should get you started somewhere. 在很多情况下,这将失败,但是它应该使您入门。

See also 也可以看看


Okay, I've "generalized" the problem so that it's not HTML related. 好的,我已经“概括”了这个问题,因此它与HTML无关。 Here's a snippet of Java that uses regex to remove [aeiou] from portions of a sentence enclosed by < and > , whose usage is reserved only to mark these special portions. 这是Java的一个片段,它使用正则表达式从<>包围的句子部分中删除[aeiou] ,其用法仅保留用于标记这些特殊部分。

BEWARE: this regex is absolutely unreadable. 注意:此正则表达式绝对不可读。 But yes, it works. 但是,可以。 And it uses no lookbehind, too. 而且,它也无需使用任何后退。

String s = "Wait <whaaat?> does this <really really> work???";
System.out.println(
    s.replaceAll("(?!>)(?:(?=<)|(?=\\G)(?!^))(?:(?:(?![aeiou])(.))|.)", "$1")
); // prints "Wait <wht?> does this <rlly rlly> work???"

I might try to explain it if there's interest, but otherwise I'd suggesting using a simple loop like this instead: 如果有兴趣,我可能会尝试解释它,但是否则,我建议改用像这样的简单循环:

allocate output buffer
set isInside := false
for every character ch in input
   if (ch is openChar)
      isInside := true
   else if (ch is closeChar)
      isInside := false
   else if not (isInside and ch is control)
      append ch to buffer

You shouldn't generally use regex to parse html - but this is not html to begin with and hence you can't use a parser. 通常,您不应该使用正则表达式来解析html-但这并不是html开头的,因此您不能使用解析器。 The following seems to work. 以下似乎有效。

var s = "The quick 1<strong>orange</strong> lemming <sp11a1n 1class1='jumpe111r'11>jumps over</span> 1the idle 1frog";
while(s.match(/<[^>]*?1(?=[^>]*>)/))
  s = s.replace(/(<[^>]*?)1(?=[^>]*>)/g, "$1");
console.log(s); //"The quick 1<strong>orange</strong> lemming <span class='jumper'>jumps over</span> 1the idle 1frog"

I get that you're not "parsing" it as such. 我知道您不是这样“解析”它的。 You do however need to work out what is html tags and what isn't, this requires parsing and using a regex alone will not manage this. 但是,您确实需要弄清楚什么是html标签,什么不是,这需要解析,仅使用正则表达式将无法解决这个问题。

Maybe the solution to the control chars in tag names is to replace globally all the control chars with a valid text pattern. 标记名称中的控制字符的解决方案也许是用有效的文本模式全局替换所有控制字符。

Then you can parse the resulting xml/html with an xml/html document parser. 然后,您可以使用xml / html文档解析器解析生成的xml / html。 You can then run through this to perform your search and replaces on tagnames, attribute names, values. 然后,您可以执行此操作以执行搜索并替换标记名,属性名,值。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM