简体   繁体   中英

How to keep characters with regular expressions that I don't want to delete in python?

I use this code to delete all tag elements in HTML.

import re
MyString = 'aaa<p>Radio and television.<br></p><p>very<br/> popular in the world today.</p><p>Millions of people watch TV. </p><p>That’s because a radio is very small <span_style=":_black;">98.2%</span></p><p>and it‘s easy to carry. <span_style=":_black;">haha100%</span></p>bb'
MyString = re.sub('<[^>]*>', '', MyString)
print(MyString)

The output is:

aaaRadio and television.very popular in the world today.Millions of people watch TV. That’s because a radio is very small 98.2%and it‘s easy to carry. haha100%bb

But now I need to keep <br> and <br/> .

I want the output likes this:

aaaRadio and television.<br>very<br/> popular in the world today.Millions of people watch TV. That’s because a radio is very small 98.2%and it‘s easy to carry. haha100%bb

How to modify my code?

You can capture <br> tags separately in group1 and capture any other tag separately and replace the whole match with \\1 to retain <br> tags and remove rest other tags. Replace

(?i)(<br\/?>)|<[^>]*>

with \\1 . Also added (?i) inline modifier (you can also pass re.IGNORECASE as fourth argument in re.sub to make it case-insensitive) to make the regex case insensitive for also matching it with <BR> or <BR/>

Regex Demo

Your updated Python code,

import re
MyString = 'aaa<p>Radio and television.<br></p><p>very<br/> popular <BR>in the <BR/>world today.</p><p>Millions of people watch TV. </p><p>That’s because a radio is very small <span_style=":_black;">98.2%</span></p><p>and it‘s easy to carry. <span_style=":_black;">haha100%</span></p>bb'
MyString = re.sub('(?i)(<br/?>)|<[^>]*>', r'\1', MyString)
print(MyString)

Prints the string with br tag only and rest tags removed,

aaaRadio and television.<br>very<br/> popular <BR>in the <BR/>world today.Millions of people watch TV. That’s because a radio is very small 98.2%and it‘s easy to carry. haha100%bb

In another approach, you can also use a negative look ahead to reject tags that are br using this regex,

(?i)<(?!br/?>)[^>]*>

and just replace it with empty string.

Regex Demo using negative lookahead to reject

Python code using negative lookahead regex,

import re
MyString = 'aaa<p>Radio and television.<br></p><p>very<br/> popular <BR>in the <BR/>world today.</p><p>Millions of people watch TV. </p><p>That’s because a radio is very small <span_style=":_black;">98.2%</span></p><p>and it‘s easy to carry. <span_style=":_black;">haha100%</span></p>bb'
MyString = re.sub('(?i)<(?!br/?>)[^>]*>', r'', MyString)
print(MyString)

Prints,

aaaRadio and television.<br>very<br/> popular <BR>in the <BR/>world today.Millions of people watch TV. That’s because a radio is very small 98.2%and it‘s easy to carry. haha100%bb

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM