简体   繁体   中英

How to only select certain p tags without children?

I am new to the beautifulSoup and here is a naive question I have when I want to scrape some information on university course websites. The html is as followed and I'd like to get the text between tags p but not tags p which have some children like <strong> and <em>

The text desired:This course introduces....

Really appreciate your help!

<p>
<strong>MSDS 402 Introduction to Data Science</strong>
</p >
<p>This course introduces.....</p >
<p>
<em>Prerequisites: None.</em>
</p >
<p><a aria-label="MSDS 402-DL Section, ID#: 4765" class="link-list" href=" ">View MSDS 402-DL Sections</a ></p >

You can use CSS selector p:not(:has(*)) that will select tags <p> without any children tags.

For example:

from bs4 import BeautifulSoup


txt = '''<p>
<strong>MSDS 402 Introduction to Data Science</strong>
</p >
<p>This course introduces.....</p >
<p>
<em>Prerequisites: None.</em>
</p >
<p><a aria-label="MSDS 402-DL Section, ID#: 4765" class="link-list" href=" ">View MSDS 402-DL Sections</a ></p >'''


soup = BeautifulSoup(txt, 'html.parser')

for p in soup.select('p:not(:has(*))'):
    print(p)

Prints:

<p>This course introduces.....</p>

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM