How to keep characters with regular expressions that I don't want to delete in python?

Question

I use this code to delete all tag elements in HTML.

import re
MyString = 'aaa<p>Radio and television.<br></p><p>very<br/> popular in the world today.</p><p>Millions of people watch TV. </p><p>That’s because a radio is very small <span_style=":_black;">98.2%</span></p><p>and it‘s easy to carry. <span_style=":_black;">haha100%</span></p>bb'
MyString = re.sub('<[^>]*>', '', MyString)
print(MyString)

The output is:

aaaRadio and television.very popular in the world today.Millions of people watch TV. That’s because a radio is very small 98.2%and it‘s easy to carry. haha100%bb

But now I need to keep   and   .

I want the output likes this:

aaaRadio and television.<br>very<br/> popular in the world today.Millions of people watch TV. That’s because a radio is very small 98.2%and it‘s easy to carry. haha100%bb

How to modify my code?

Answer 1

You can capture   tags separately in group1 and capture any other tag separately and replace the whole match with \\1 to retain   tags and remove rest other tags. Replace

(?i)(<br\/?>)|<[^>]*>

with \\1 . Also added (?i) inline modifier (you can also pass re.IGNORECASE as fourth argument in re.sub to make it case-insensitive) to make the regex case insensitive for also matching it with   or  

Regex Demo

Your updated Python code,

import re
MyString = 'aaa<p>Radio and television.<br></p><p>very<br/> popular <BR>in the <BR/>world today.</p><p>Millions of people watch TV. </p><p>That’s because a radio is very small <span_style=":_black;">98.2%</span></p><p>and it‘s easy to carry. <span_style=":_black;">haha100%</span></p>bb'
MyString = re.sub('(?i)(<br/?>)|<[^>]*>', r'\1', MyString)
print(MyString)

Prints the string with br tag only and rest tags removed,

aaaRadio and television.<br>very<br/> popular <BR>in the <BR/>world today.Millions of people watch TV. That’s because a radio is very small 98.2%and it‘s easy to carry. haha100%bb

In another approach, you can also use a negative look ahead to reject tags that are br using this regex,

(?i)<(?!br/?>)[^>]*>

and just replace it with empty string.

Regex Demo using negative lookahead to reject

Python code using negative lookahead regex,

import re
MyString = 'aaa<p>Radio and television.<br></p><p>very<br/> popular <BR>in the <BR/>world today.</p><p>Millions of people watch TV. </p><p>That’s because a radio is very small <span_style=":_black;">98.2%</span></p><p>and it‘s easy to carry. <span_style=":_black;">haha100%</span></p>bb'
MyString = re.sub('(?i)<(?!br/?>)[^>]*>', r'', MyString)
print(MyString)

Prints,

aaaRadio and television.<br>very<br/> popular <BR>in the <BR/>world today.Millions of people watch TV. That’s because a radio is very small 98.2%and it‘s easy to carry. haha100%bb

How to keep characters with regular expressions that I don't want to delete in python?

Question

1 answers

solution1
1 ACCPTED 2019-04-30 09:22:20

How to keep characters with regular expressions that I don't want to delete in python?

Question

1 answers

solution1 1 ACCPTED 2019-04-30 09:22:20

solution1
1 ACCPTED 2019-04-30 09:22:20