简体   繁体   中英

Python Beautiful Soup 4 Get Children of Element with .select()

The .select() element allows me to get an element off a web page based on a css selector, but this will search the whole web page. How would I use .select() but search only children of a specific element. Eg:

<!-- Simplified example of the structure -->
<ul>
    <li>
        <div class="foo">foo content</div>
        <div class="bar">bar content</div>
        <div class="baz">baz content</div>
    </li>
    <li>
        <!-- We can't assume that foo, bar, and baz will always be there -->
        <div class="foo">foo content</div>
        <div class="baz">baz content</div>
    </li>
    <li>
        <div class="foo">foo content</div>
        <div class="bar">bar content</div>
        <div class="baz">baz content</div>
    </li>
</ul>

I want a way to say: for <li> [0] foo contained the value "foo content" , bar contained the value "bar content" etc..

Currently my solution is the following:

foos = soup.select("div.foo")
bars = soup.select("div.bar")
bazs = soup.select("div.baz")

for i in range(len(foos)):
    print("{i} contains: {} and {} and {}".format(i=i, foos[i], bars[i], bazs[i]))

This works for the most part. But it completly falls apart when an element is missing from one of the li's. Like I showed in the HTML, we cannot assume that the three bar, baz and foo elements will be present.

Thus, how would I search only children of the lis. Thus I could do something like this:

for i in soup.select("li"):
    #how would i do this:
    foo = child_of("li", "div.foo")????
    bar = child_of("li", "div.bar")????
    baz = child_of("li", "div.baz")????

You can use element:nth-of-type(n) like so:

from bs4 import BeautifulSoup

a = """<!-- Simplified example of the structure -->
<ul>
    <li>
        <div class="foo">foo1 content</div>
        <div class="bar">bar1 content</div>
        <div class="baz">baz1 content</div>
    </li>
    <li>
        <!-- We can't assume that foo, bar, and baz will always be there -->
        <div class="foo">foo2 content</div>
        <div class="baz">baz2 content</div>
    </li>
    <li>
        <div class="foo">foo3 content</div>
        <div class="bar">bar3 content</div>
        <div class="baz">baz3 content</div>
    </li>
</ul>
"""

s = BeautifulSoup(a)
s2 = s.select('ul > li:nth-of-type(2)')[0]
foo, bar, baz = s2.select('div.foo'), s2.select('div.bar'), s2.select('div.baz')
print foo, bar, baz

Output:

[<div class="foo">foo2 content</div>] [] [<div class="baz">baz2 content</div>]
for li in soup.select('li'):
    foo = li.select('.foo')
    bar = li.select('.bar')
    baz = li.select('.baz')

each time you iterate over the li tag and use the select() , the html code to be selected is only the li tag's content, like:

<li>
    <div class="foo">foo content</div>
    <div class="bar">bar content</div>
    <div class="baz">baz content</div>
</li>

So, you can use select() to select li's child because li only contains the child tag.

This worked for me and all the foos, bars and bazs are being stored in separate lists

foos = []
bars = []
bazs = []
for i in soup.find_all('li'):
    soup2 = BeautifulSoup(str(i))
    print soup2
    for _ in soup2.find_all('div', {'class':'foo'}):
        foos.append(_)
    for _ in soup2.find_all('div', {'class': 'bar'}):
        bars.append(_)
    for _ in soup2.find_all('div', {'class': 'baz'}):
        bazs.append(_)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM