简体   繁体   中英

web scraping without id

I need help about web scraping. Here is the html example:

<div class="content" name="content-name">
   <h2 class="Topic">First Topic</h2>
   <ul>
      <li>This Data 1</li>
      <li>This Data 2</li>
      <li>This Data 3</li>
   </ul>
   <h2 class="Topic">Second Topic</h2>
   <ul>
      <li>That Data 1</li>
      <li>That Data 2</li>
      <li>That Data 3</li>
   </ul>
   <h2 class="Topic">Third Topic</h2>
   <ul>
      <li>Their Data 1</li>
      <li>Their Data 2</li>
      <li>Their Data 3</li>
   </ul>
</div>

Using BeautifulSoup, I could get the html div tag for name="content-name". But how do i get all text with li tag inside ul after h2 tag that have "second topic" text? Because all of that is in the same div tag without specific class,id,or name. Thanks in advance.

It is always more difficult when tags don't have ids or classes or parent tags.

You can use find_previous_sibling

from bs4 import BeautifulSoup
html = """
<div class="content" name="content-name">
   <h2 class="Topic">First Topic</h2>
   <ul>
      <li>This Data 1</li>
      <li>This Data 2</li>
      <li>This Data 3</li>
   </ul>
   <h2 class="Topic">Second Topic</h2>
   <ul>
      <li>That Data 1</li>
      <li>That Data 2</li>
      <li>That Data 3</li>
   </ul>
   <h2 class="Topic">Third Topic</h2>
   <ul>
      <li>Their Data 1</li>
      <li>Their Data 2</li>
      <li>Their Data 3</li>
   </ul>
</div>
"""
soup = BeautifulSoup(html, 'html.parser')

for ul in soup.find_all('ul'):
    if ul.find_previous_sibling('h2').text == 'Second Topic':
        for li in ul.find_all('li'):
            print(li.text)

Returns

That Data 1
That Data 2
That Data 3
from bs4 import BeautifulSoup

src = """
<div class="content" name="content-name">
    <h2 class="Topic">First Topic</h2>
    <ul>
        <li>This Data 1</li>
        <li>This Data 2</li>
        <li>This Data 3</li>
    </ul>
    <h2 class="Topic">Second Topic</h2>
    <ul>
        <li>That Data 1</li>
        <li>That Data 2</li>
        <li>That Data 3</li>
    </ul>
    <h2 class="Topic">Third Topic</h2>
    <ul>
        <li>Their Data 1</li>
        <li>Their Data 2</li>
        <li>Their Data 3</li>
    </ul>
</div>
"""

soup = BeautifulSoup(src, 'lxml')

content = soup.find_all("div", class_="content")[0]


second_topic = content.find_all("h2", class_="Topic", string="Second Topic")[0]

ul = second_topic.next_sibling.next_sibling

li = ul.find_all("li")
for i in li:
    print(i.string)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM