简体   繁体   中英

How to get sub sub sub childs with Beautifulsoup?

I have the following html construct

...

<div cust-attrib-id="root">
  <div cust-attrib-id="root-title"></div>
  <div cust-attrib-id="country">
        <div cust-attrib-id="country-title"></div>
        <div cust-attrib-id="region">
            <div cust-attrib-id="region-title">
                <a href="xx">Frankfurt</a>
            </div>
            <div cust-attrib-id="region-title">
                <a href="xx">Braunschweig</a>
            </div>
            <div cust-attrib-id="region-title">
                <a href="xx">Hamm</a>
            </div>
            ...
        </div>    
    </div>
</div>
...

What is the easiest way to get the <a> -Tags with the list of regions in Python when using Beautifulsoap? Each <a> tag belongs to a Div with the custom attribute cust-attrib-id and the value region-title .

I am at the div with the custom attriubute-value root and i would like to iterate over al sub sub sub <a> 's within the div 's with the custom attribute cust-attrib-id and value = 'region-title'.

I am selectingthe root element via

soup = BeautifulSoup(source, "html.parser")

rootCategories = soup.select('div[cust-attrib-id="root"]')

Now i could find country , then find all region 's and iterate over the result via for... in... . But i am looking for a "shortcut" to get these items queried.

So the desired result would be an output like

Frankfurt Braunschweig Hamm

and

cities = soup.select('div[cust-attrib-id="root"]\\div[cust-attrib-id="country"]\\div[cust-attrib-id="region-title"]')

I think having it cascaded in the query makes it more safe, cause attribute value region-title is not unique on the page.

Note: Good answers require good questions, please help make your problem comprehensible to all by improving your question. In general, the existing code and the expected result should be presented as text. Please always provide an mcve in your questions.

You can use css selectors to select all the <a> in your html.

EDIT (based on your changes)

I think having it cascaded in the query makes it more safe, cause attribute value region-title is not unique on the page.

Making your selection as specific as possible is a very good train of thought - Just chain the selectors of attributes and tags to get all the <a> you need:

soup.select('div[cust-attrib-id="root"] [cust-attrib-id="region-title"] a')

To get a list of all the city names you can use your selection and a list comprehension :

cities = [t.text for t in soup.select('div[cust-attrib-id="root"] [cust-attrib-id="region-title"] a')]

Example

from bs4 import BeautifulSoup
    
html = '''<div cust-attrib-id="root">
  <div cust-attrib-id="root-title"></div>
  <div cust-attrib-id="country">
        <div cust-attrib-id="country-title"></div>
        <div cust-attrib-id="region">
            <div cust-attrib-id="region-title">
                <a href="xx">Frankfurt</a>
            </div>
            <div cust-attrib-id="region-title">
                <a href="xx">Braunschweig</a>
            </div>
            <div cust-attrib-id="region-title">
                <a href="xx">Hamm</a>
            </div>
            ...
        </div>    
    </div>
</div>'''

soup = BeautifulSoup(html, "lxml")

cities = [t.text for t in soup.select('div[cust-attrib-id="root"] [cust-attrib-id="region-title"] a')]

Output

['Frankfurt', 'Braunschweig', 'Hamm']

On your example "region-title" it what you want, so just get every "region-title"

for x in soup.find_all(attrs={"cust-attrib-id": 'region-title'}):
    print(x.getText())

Output:

Frankfurt
Braunschweig
Hamm

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM