I want to get the html file tags data without nested tags (prefer : BeautifulSoup base solution) but regex will also work eg:
`<li><p>HELLO1</p></li > <li>HELLO2</li><p>HELLO3</p>`
answer
HELLO1 HELLO2 HELLO3
I tried to use regex but didn't find how to use for soup object str(soup).replace("< li > < p >","< p >")
tags = soup.find_all(['p','li'])
it returns:
< p >HELLO1< /p >,
HELLO1 ,
HELLO2 ,
HELLO3
if there is li and p tags are nested result should show only one occurrence or one nested tag should be removed. eg:if < li >< p >XYZ< /p >< /li > it should becomes < li >XYZ< /li >
You could use .get_text()
method:
data = '''<li><p>HELLO1</p></li > <li>HELLO2</li><p>HELLO3</p>'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(data, 'html.parser')
print(soup.get_text(separator=' ', strip=True))
Prints:
HELLO1 HELLO2 HELLO3
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.