How can I remove certain characters from inside angle-brackets, leaving the characters outside alone?

Question

Edit: To be clear, please understand that I am not using Regex to parse the html, that's crazy talk! I'm simply wanting to clean up a messy string of html so it will parse

Edit #2: I should also point out that the control character I'm using is a special unicode character - it's not something that would ever be used in a proper tag under any normal circumstances

Suppose I have a string of html that contains a bunch of control characters and I want to remove the control characters from inside tags only, leaving the characters outside the tags alone.

For example

Here the control character is the numeral "1".

Input

The quick 1<strong>orange</strong> lemming <sp11a1n 1class1='jumpe111r'11>jumps over</span> 1the idle 1frog

Desired Output

The quick 1<strong>orange</strong> lemming <span class='jumper'>jumps over</span> 1the idle 1frog

So far I can match tags which contain the control character but I can't remove them in one regex. I guess I could perform another regex on my matches, but I'd really like to know if there's a better way.

My regex

Bear in mind this one only matches tags which contain the control character.

<(([^>])*?`([^>])*?)*?>

Thanks very much for your time and consideration.

Iain Fraser

Answer 1

Regex isn't the tool for this, but you can use lookbehind and lookahead to match 1 in a tag. Here it is in Java, modified to have finite lookbehind (since Java doesn't support infinite length lookbehind).

    String s = "123 <o123o></o1o1> <oo 11='11x'> x11 <msg136='I <3 Johnny!11'>";
    System.out.println(
        s.replaceAll("(?<=<[^<>]{0,999})(?=[^<>]+>)1", "")
    ); // prints "123 <o23o></oo> <oo ='x'> x11 <msg136='I <3 Johnny!'>

There are many cases where this will fail, but it should get you started somewhere.

var s = "The quick 1<strong>orange</strong> lemming <sp11a1n 1class1='jumpe111r'11>jumps over</span> 1the idle 1frog";
while(s.match(/<[^>]*?1(?=[^>]*>)/))
  s = s.replace(/(<[^>]*?)1(?=[^>]*>)/g, "$1");
console.log(s); //"The quick 1<strong>orange</strong> lemming <span class='jumper'>jumps over</span> 1the idle 1frog"

Answer 3

I get that you're not "parsing" it as such. You do however need to work out what is html tags and what isn't, this requires parsing and using a regex alone will not manage this.

Maybe the solution to the control chars in tag names is to replace globally all the control chars with a valid text pattern.

Then you can parse the resulting xml/html with an xml/html document parser. You can then run through this to perform your search and replaces on tagnames, attribute names, values.

How can I remove certain characters from inside angle-brackets, leaving the characters outside alone?

Question

For example

Input

Desired Output

My regex

3 answers

solution1
2 ACCPTED 2010-05-12 08:08:22

See also

solution2
1 2010-05-12 08:28:19

solution3
0 2010-05-12 08:32:25

How can I remove certain characters from inside angle-brackets, leaving the characters outside alone?

Question

For example

Input

Desired Output

My regex

3 answers

solution1 2 ACCPTED 2010-05-12 08:08:22

See also

solution2 1 2010-05-12 08:28:19

solution3 0 2010-05-12 08:32:25

solution1
2 ACCPTED 2010-05-12 08:08:22

solution2
1 2010-05-12 08:28:19

solution3
0 2010-05-12 08:32:25