简体   繁体   English

使用python抓取网站时获取最大页面数

[英]Getting max pagenumber when scraping website with python

I am very new to python and have to scrape a website for some data for a course at university: 我是python的新手,必须抓取一个网站来获取大学课程的一些数据:

Xrel Xrel

I am able to get the information i need. 我能够获得所需的信息。 The problem is i need it for every entry(page, month, year). 问题是我需要为每个条目(页面,月,年)。

The amount of pages differs for every month. 每个月的页面数量不同。 Is there any way to extract the maximum pagenumber so i can store it and use it for a loop? 有什么方法可以提取最大页数,以便我可以存储它并将其用于循环吗?

I would appreciate any help. 我将不胜感激任何帮助。 Thanks! 谢谢!

For loops are nice but you can't always use them. For循环很不错,但是您不能总是使用它们。 In this case I would just repeatedly follow the link in the 'next page' button until there is no such button. 在这种情况下,我将反复点击“下一页”按钮中的链接,直到没有此类按钮为止。 Something like this: 像这样:

url = <first page>
while True:
    # extract data
    if <there is a next page button>:
        url = <href of the button>
    else:
        break

This will get all your pages, yielding a BeautifulSoup object for each, the link to the next page is in the anchor tag with the class forward : 这将获取您的所有页面,并为每个页面生成一个BeautifulSoup对象,指向下一页的链接位于前进类别的anchor标记中:

import requests
from urlparse import urljoin


def get_pages(base, url):
    soup = BeautifulSoup(requests.get(url).content)
    yield soup
    next_page = soup.select_one("a.forward")
    for page in iter(lambda: next_page, None):
        soup = BeautifulSoup(requests.get(urljoin(base, page["href"])).content)
        yield soup
        next_page = soup.select_one("a.forward")



for soup in get_pages("https://www.xrel.to/", "https://www.xrel.to/games-release-list.html?archive=2016-01"):
    print(soup)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM