简体   繁体   English

Python Beautiful Soup 4使用.select()获取子元素

[英]Python Beautiful Soup 4 Get Children of Element with .select()

The .select() element allows me to get an element off a web page based on a css selector, but this will search the whole web page. .select()元素使我可以从基于CSS选择器的网页上获取一个元素,但这将搜索整个网页。 How would I use .select() but search only children of a specific element. 我将如何使用.select()但仅搜索特定元素的子级。 Eg: 例如:

<!-- Simplified example of the structure -->
<ul>
    <li>
        <div class="foo">foo content</div>
        <div class="bar">bar content</div>
        <div class="baz">baz content</div>
    </li>
    <li>
        <!-- We can't assume that foo, bar, and baz will always be there -->
        <div class="foo">foo content</div>
        <div class="baz">baz content</div>
    </li>
    <li>
        <div class="foo">foo content</div>
        <div class="bar">bar content</div>
        <div class="baz">baz content</div>
    </li>
</ul>

I want a way to say: for <li> [0] foo contained the value "foo content" , bar contained the value "bar content" etc.. 我想说一种方式:对于<li> [0] foo包含值"foo content" ,bar包含值"bar content"等。

Currently my solution is the following: 目前,我的解决方案如下:

foos = soup.select("div.foo")
bars = soup.select("div.bar")
bazs = soup.select("div.baz")

for i in range(len(foos)):
    print("{i} contains: {} and {} and {}".format(i=i, foos[i], bars[i], bazs[i]))

This works for the most part. 这在大多数情况下都有效。 But it completly falls apart when an element is missing from one of the li's. 但是当li的一个元素中缺少某个元素时,它会完全崩溃。 Like I showed in the HTML, we cannot assume that the three bar, baz and foo elements will be present. 就像我在HTML中显示的那样,我们不能假定将出现三个bar,baz和foo元素。

Thus, how would I search only children of the lis. 因此,我将如何仅搜索lis的子代。 Thus I could do something like this: 因此我可以做这样的事情:

for i in soup.select("li"):
    #how would i do this:
    foo = child_of("li", "div.foo")????
    bar = child_of("li", "div.bar")????
    baz = child_of("li", "div.baz")????

You can use element:nth-of-type(n) like so: 您可以像这样使用element:nth-of-type(n)

from bs4 import BeautifulSoup

a = """<!-- Simplified example of the structure -->
<ul>
    <li>
        <div class="foo">foo1 content</div>
        <div class="bar">bar1 content</div>
        <div class="baz">baz1 content</div>
    </li>
    <li>
        <!-- We can't assume that foo, bar, and baz will always be there -->
        <div class="foo">foo2 content</div>
        <div class="baz">baz2 content</div>
    </li>
    <li>
        <div class="foo">foo3 content</div>
        <div class="bar">bar3 content</div>
        <div class="baz">baz3 content</div>
    </li>
</ul>
"""

s = BeautifulSoup(a)
s2 = s.select('ul > li:nth-of-type(2)')[0]
foo, bar, baz = s2.select('div.foo'), s2.select('div.bar'), s2.select('div.baz')
print foo, bar, baz

Output: 输出:

[<div class="foo">foo2 content</div>] [] [<div class="baz">baz2 content</div>]
for li in soup.select('li'):
    foo = li.select('.foo')
    bar = li.select('.bar')
    baz = li.select('.baz')

each time you iterate over the li tag and use the select() , the html code to be selected is only the li tag's content, like: 每次您遍历li标记并使用select() ,要选择的html代码只是li标记的内容,例如:

<li>
    <div class="foo">foo content</div>
    <div class="bar">bar content</div>
    <div class="baz">baz content</div>
</li>

So, you can use select() to select li's child because li only contains the child tag. 因此,您可以使用select()选择li的孩子,因为li仅包含child标记。

This worked for me and all the foos, bars and bazs are being stored in separate lists 这对我有用,所有的foo,bar和bazs都存储在单独的列表中

foos = []
bars = []
bazs = []
for i in soup.find_all('li'):
    soup2 = BeautifulSoup(str(i))
    print soup2
    for _ in soup2.find_all('div', {'class':'foo'}):
        foos.append(_)
    for _ in soup2.find_all('div', {'class': 'bar'}):
        bars.append(_)
    for _ in soup2.find_all('div', {'class': 'baz'}):
        bazs.append(_)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM