I need help about web scraping. Here is the html example:
<div class="content" name="content-name">
<h2 class="Topic">First Topic</h2>
<ul>
<li>This Data 1</li>
<li>This Data 2</li>
<li>This Data 3</li>
</ul>
<h2 class="Topic">Second Topic</h2>
<ul>
<li>That Data 1</li>
<li>That Data 2</li>
<li>That Data 3</li>
</ul>
<h2 class="Topic">Third Topic</h2>
<ul>
<li>Their Data 1</li>
<li>Their Data 2</li>
<li>Their Data 3</li>
</ul>
</div>
Using BeautifulSoup, I could get the html div tag for name="content-name". But how do i get all text with li tag inside ul after h2 tag that have "second topic" text? Because all of that is in the same div tag without specific class,id,or name. Thanks in advance.
It is always more difficult when tags don't have ids or classes or parent tags.
You can use find_previous_sibling
from bs4 import BeautifulSoup
html = """
<div class="content" name="content-name">
<h2 class="Topic">First Topic</h2>
<ul>
<li>This Data 1</li>
<li>This Data 2</li>
<li>This Data 3</li>
</ul>
<h2 class="Topic">Second Topic</h2>
<ul>
<li>That Data 1</li>
<li>That Data 2</li>
<li>That Data 3</li>
</ul>
<h2 class="Topic">Third Topic</h2>
<ul>
<li>Their Data 1</li>
<li>Their Data 2</li>
<li>Their Data 3</li>
</ul>
</div>
"""
soup = BeautifulSoup(html, 'html.parser')
for ul in soup.find_all('ul'):
if ul.find_previous_sibling('h2').text == 'Second Topic':
for li in ul.find_all('li'):
print(li.text)
Returns
That Data 1
That Data 2
That Data 3
from bs4 import BeautifulSoup
src = """
<div class="content" name="content-name">
<h2 class="Topic">First Topic</h2>
<ul>
<li>This Data 1</li>
<li>This Data 2</li>
<li>This Data 3</li>
</ul>
<h2 class="Topic">Second Topic</h2>
<ul>
<li>That Data 1</li>
<li>That Data 2</li>
<li>That Data 3</li>
</ul>
<h2 class="Topic">Third Topic</h2>
<ul>
<li>Their Data 1</li>
<li>Their Data 2</li>
<li>Their Data 3</li>
</ul>
</div>
"""
soup = BeautifulSoup(src, 'lxml')
content = soup.find_all("div", class_="content")[0]
second_topic = content.find_all("h2", class_="Topic", string="Second Topic")[0]
ul = second_topic.next_sibling.next_sibling
li = ul.find_all("li")
for i in li:
print(i.string)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.