[英]Beautiful Soup - Selecting Class has Unexpected Results
I am new to programming and have been learning Python through web scraping.我是编程新手,一直在学习Python通过 web 抓取。 What I am trying to do is capture the below line from the site listed in my URL:
我要做的是从我的 URL 中列出的站点中捕获以下行:
<a class="" href="https://www.adweek.com?paged=776%3Fs%3Dinterpublic&orderby=date&s=interpublic">776</a>
, but I cannot seem to get there. <a class="" href="https://www.adweek.com?paged=776%3Fs%3Dinterpublic&orderby=date&s=interpublic">776</a>
,但我似乎无法到达那里。 It only returns the first line of pagination information and I can't figure out why.它只返回第一行分页信息,我不知道为什么。 Any help would be greatly appreciated
任何帮助将不胜感激
import requests
from bs4 import BeautifulSoup
url = 'https://www.adweek.com/?s=interpublic&orderby=date'
req = requests.get(url)
soup = BeautifulSoup(req.content, 'html.parser')
k =soup.find_all('div', {'class':'pagination-centered'})
Returns only --只退货——
[<div class="pagination-centered"><ul class="pagination">
<li><span aria-current="page" class="current">1</span></li></ul></div>]
Thanks, Seth谢谢,赛斯
You can get pagination using a[href*="paged="]
css selector:您可以使用
a[href*="paged="]
css 选择器进行分页:
import requests
from bs4 import BeautifulSoup
url = 'https://www.adweek.com/?s=interpublic&orderby=date'
req = requests.get(url)
soup = BeautifulSoup(req.content, 'html.parser')
# print text and href
pagination = soup.select('a[href*="paged="]')
for p in pagination:
print(p.text.strip(), p.get('href'))
"Next" has same url as first link, you can use set
to get only unique href. “Next”与第一个链接具有相同的 url,您可以使用
set
来获取唯一的 href。 : :
pagination = {p['href'] for p in soup.select('a[href*="paged="]')}
You can get last page number and iterate by changing parameter paged
in the url until the last page.您可以通过更改 url 中的
paged
参数直到最后一页来获取最后一页编号并进行迭代。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.