I use this code to delete all tag elements in HTML.
import re
MyString = 'aaa<p>Radio and television.<br></p><p>very<br/> popular in the world today.</p><p>Millions of people watch TV. </p><p>That’s because a radio is very small <span_style=":_black;">98.2%</span></p><p>and it‘s easy to carry. <span_style=":_black;">haha100%</span></p>bb'
MyString = re.sub('<[^>]*>', '', MyString)
print(MyString)
The output is:
aaaRadio and television.very popular in the world today.Millions of people watch TV. That’s because a radio is very small 98.2%and it‘s easy to carry. haha100%bb
But now I need to keep <br>
and <br/>
.
I want the output likes this:
aaaRadio and television.<br>very<br/> popular in the world today.Millions of people watch TV. That’s because a radio is very small 98.2%and it‘s easy to carry. haha100%bb
How to modify my code?
You can capture <br>
tags separately in group1 and capture any other tag separately and replace the whole match with \\1
to retain <br>
tags and remove rest other tags. Replace
(?i)(<br\/?>)|<[^>]*>
with \\1
. Also added (?i)
inline modifier (you can also pass re.IGNORECASE
as fourth argument in re.sub
to make it case-insensitive) to make the regex case insensitive for also matching it with <BR>
or <BR/>
Your updated Python code,
import re
MyString = 'aaa<p>Radio and television.<br></p><p>very<br/> popular <BR>in the <BR/>world today.</p><p>Millions of people watch TV. </p><p>That’s because a radio is very small <span_style=":_black;">98.2%</span></p><p>and it‘s easy to carry. <span_style=":_black;">haha100%</span></p>bb'
MyString = re.sub('(?i)(<br/?>)|<[^>]*>', r'\1', MyString)
print(MyString)
Prints the string with br
tag only and rest tags removed,
aaaRadio and television.<br>very<br/> popular <BR>in the <BR/>world today.Millions of people watch TV. That’s because a radio is very small 98.2%and it‘s easy to carry. haha100%bb
In another approach, you can also use a negative look ahead to reject tags that are br
using this regex,
(?i)<(?!br/?>)[^>]*>
and just replace it with empty string.
Regex Demo using negative lookahead to reject
Python code using negative lookahead regex,
import re
MyString = 'aaa<p>Radio and television.<br></p><p>very<br/> popular <BR>in the <BR/>world today.</p><p>Millions of people watch TV. </p><p>That’s because a radio is very small <span_style=":_black;">98.2%</span></p><p>and it‘s easy to carry. <span_style=":_black;">haha100%</span></p>bb'
MyString = re.sub('(?i)<(?!br/?>)[^>]*>', r'', MyString)
print(MyString)
Prints,
aaaRadio and television.<br>very<br/> popular <BR>in the <BR/>world today.Millions of people watch TV. That’s because a radio is very small 98.2%and it‘s easy to carry. haha100%bb
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.