简体   繁体   中英

Python BS4 getting all unembedded `li`

I'm trying to grab only the first li items on each column, but I either get everything on the page within the div or on the first in each column.

I've tried setting recursive=False but that's only let to getting the first in each column.

https://exrx.net/Lists/Directory - link to the site I'm using for this.

My goal with that is to grab the href of Shoulders, Neck, Chest, etc, without any of the children under it. I'm looking for those specifically as anything under it is repetitive and is leading me to the same general page. This is the code I currently have.

cut_url = "https://exrx.net/Lists/"

exrx = []


for a in soup.findAll('div', class_='col-sm-6'):
    for li in a.find_all('ul'):
        a = li.find('a')
        url = cut_url + a['href']
        exrx.append(url)

Try:

import requests
from bs4 import BeautifulSoup

url = "https://exrx.net/Lists/Directory"
soup = BeautifulSoup(requests.get(url).content, "html.parser")

for a in soup.select("article li:has(>ul) > a"):
    print(
        a.text,
        "https://exrx.net/Lists/" + a["href"]
        if "https" not in a["href"]
        else a["href"],
    )

Prints:

Neck https://exrx.net/Lists/ExList/NeckWt
Shoulders https://exrx.net/Lists/ExList/ShouldWt
Upper Arms https://exrx.net/Lists/ExList/ArmWt
Forearms https://exrx.net/Lists/ExList/ForeArmWt
Back https://exrx.net/Lists/ExList/BackWt
Chest https://exrx.net/Lists/ExList/ChestWt
Waist https://exrx.net/Lists/ExList/WaistWt
Hips https://exrx.net/Lists/ExList/HipsWt
Thighs https://exrx.net/Lists/ExList/ThighWt
Calves https://exrx.net/Lists/ExList/CalfWt

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM