[英]Python BS4 getting all unembedded `li`
I'm trying to grab only the first li
items on each column, but I either get everything on the page within the div
or on the first in each column.我试图只获取每列中的第一个
li
项目,但我要么获取页面上div
内的所有内容,要么获取每列中第一个的所有内容。
I've tried setting recursive=False
but that's only let to getting the first in each column.我试过设置
recursive=False
但这只能让每个列中的第一个。
https://exrx.net/Lists/Directory - link to the site I'm using for this. https://exrx.net/Lists/Directory - 链接到我为此使用的网站。
My goal with that is to grab the href
of Shoulders, Neck, Chest, etc, without any of the children under it.我的目标是抓住 Shoulders、Neck、Chest 等的
href
,而没有任何孩子在它下面。 I'm looking for those specifically as anything under it is repetitive and is leading me to the same general page.我正在寻找那些具体的内容,因为它下面的任何内容都是重复的,并且将我带到相同的一般页面。 This is the code I currently have.
这是我目前拥有的代码。
cut_url = "https://exrx.net/Lists/"
exrx = []
for a in soup.findAll('div', class_='col-sm-6'):
for li in a.find_all('ul'):
a = li.find('a')
url = cut_url + a['href']
exrx.append(url)
Try:尝试:
import requests
from bs4 import BeautifulSoup
url = "https://exrx.net/Lists/Directory"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
for a in soup.select("article li:has(>ul) > a"):
print(
a.text,
"https://exrx.net/Lists/" + a["href"]
if "https" not in a["href"]
else a["href"],
)
Prints:印刷:
Neck https://exrx.net/Lists/ExList/NeckWt
Shoulders https://exrx.net/Lists/ExList/ShouldWt
Upper Arms https://exrx.net/Lists/ExList/ArmWt
Forearms https://exrx.net/Lists/ExList/ForeArmWt
Back https://exrx.net/Lists/ExList/BackWt
Chest https://exrx.net/Lists/ExList/ChestWt
Waist https://exrx.net/Lists/ExList/WaistWt
Hips https://exrx.net/Lists/ExList/HipsWt
Thighs https://exrx.net/Lists/ExList/ThighWt
Calves https://exrx.net/Lists/ExList/CalfWt
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.