简体   繁体   English

美汤——选择Class有意外结果

[英]Beautiful Soup - Selecting Class has Unexpected Results

I am new to programming and have been learning Python through web scraping.我是编程新手,一直在学习Python通过 web 抓取。 What I am trying to do is capture the below line from the site listed in my URL:我要做的是从我的 URL 中列出的站点中捕获以下行:

<a class="" href="https://www.adweek.com?paged=776%3Fs%3Dinterpublic&amp;orderby=date&amp;s=interpublic">776</a> , but I cannot seem to get there. <a class="" href="https://www.adweek.com?paged=776%3Fs%3Dinterpublic&amp;orderby=date&amp;s=interpublic">776</a> ,但我似乎无法到达那里。 It only returns the first line of pagination information and I can't figure out why.它只返回第一行分页信息,我不知道为什么。 Any help would be greatly appreciated任何帮助将不胜感激

import requests
from bs4 import BeautifulSoup
url = 'https://www.adweek.com/?s=interpublic&orderby=date'
req = requests.get(url)
soup = BeautifulSoup(req.content, 'html.parser')
k =soup.find_all('div', {'class':'pagination-centered'})

Returns only --只退货——

[<div class="pagination-centered"><ul class="pagination">
 <li><span aria-current="page" class="current">1</span></li></ul></div>]

Thanks, Seth谢谢,赛斯

You can get pagination using a[href*="paged="] css selector:您可以使用a[href*="paged="] css 选择器进行分页:

import requests
from bs4 import BeautifulSoup

url = 'https://www.adweek.com/?s=interpublic&orderby=date'
req = requests.get(url)
soup = BeautifulSoup(req.content, 'html.parser')

# print text and href
pagination = soup.select('a[href*="paged="]')
for p in pagination:
    print(p.text.strip(), p.get('href'))

"Next" has same url as first link, you can use set to get only unique href. “Next”与第一个链接具有相同的 url,您可以使用set来获取唯一的 href。 :

pagination = {p['href'] for p in soup.select('a[href*="paged="]')}

You can get last page number and iterate by changing parameter paged in the url until the last page.您可以通过更改 url 中的paged参数直到最后一页来获取最后一页编号并进行迭代。

Page source without JavaScript:没有 JavaScript 的页面源: 在此处输入图像描述

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM