简体   繁体   English

Python BS4 获取所有未嵌入的`li`

[英]Python BS4 getting all unembedded `li`

I'm trying to grab only the first li items on each column, but I either get everything on the page within the div or on the first in each column.我试图只获取每列中的第一个li项目,但我要么获取页面上div内的所有内容,要么获取每列中第一个的所有内容。

I've tried setting recursive=False but that's only let to getting the first in each column.我试过设置recursive=False但这只能让每个列中的第一个。

https://exrx.net/Lists/Directory - link to the site I'm using for this. https://exrx.net/Lists/Directory - 链接到我为此使用的网站。

My goal with that is to grab the href of Shoulders, Neck, Chest, etc, without any of the children under it.我的目标是抓住 Shoulders、Neck、Chest 等的href ,而没有任何孩子在它下面。 I'm looking for those specifically as anything under it is repetitive and is leading me to the same general page.我正在寻找那些具体的内容,因为它下面的任何内容都是重复的,并且将我带到相同的一般页面。 This is the code I currently have.这是我目前拥有的代码。

cut_url = "https://exrx.net/Lists/"

exrx = []


for a in soup.findAll('div', class_='col-sm-6'):
    for li in a.find_all('ul'):
        a = li.find('a')
        url = cut_url + a['href']
        exrx.append(url)

Try:尝试:

import requests
from bs4 import BeautifulSoup

url = "https://exrx.net/Lists/Directory"
soup = BeautifulSoup(requests.get(url).content, "html.parser")

for a in soup.select("article li:has(>ul) > a"):
    print(
        a.text,
        "https://exrx.net/Lists/" + a["href"]
        if "https" not in a["href"]
        else a["href"],
    )

Prints:印刷:

Neck https://exrx.net/Lists/ExList/NeckWt
Shoulders https://exrx.net/Lists/ExList/ShouldWt
Upper Arms https://exrx.net/Lists/ExList/ArmWt
Forearms https://exrx.net/Lists/ExList/ForeArmWt
Back https://exrx.net/Lists/ExList/BackWt
Chest https://exrx.net/Lists/ExList/ChestWt
Waist https://exrx.net/Lists/ExList/WaistWt
Hips https://exrx.net/Lists/ExList/HipsWt
Thighs https://exrx.net/Lists/ExList/ThighWt
Calves https://exrx.net/Lists/ExList/CalfWt

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM