简体   繁体   中英

How to get all the next page links from a webpage?

I've written some script in python to get all the links leading to the next page. However, it works fine only to a certain extent. The highest number of next page links is 255. Running my script, I get first 23 links along with the last page link but between them [24 to 254] are missing. How can I get all of them? Here is what I'm trying with:

import requests
from lxml import html

page_link = "https://www.yify-torrent.org/search/1080p/"
b_link = "https://www.yify-torrent.org"

def get_links(main_link):
    links = []
    response = requests.get(main_link).text
    tree = html.fromstring(response)
    for item in tree.cssselect('div.pager a'):
        if item.attrib["href"] not in links:
            links.append(item.attrib["href"])
    for link in links:
        print(b_link + link)

get_links(page_link)

Elements within the next page links lies:

<div class="pager"><a href="/search/1080p/" class="current">1</a> <a href="/search/1080p/t-2/">2</a> <a href="/search/1080p/t-3/">3</a> <a href="/search/1080p/t-4/">4</a> <a href="/search/1080p/t-5/">5</a> <a href="/search/1080p/t-6/">6</a> <a href="/search/1080p/t-7/">7</a> <a href="/search/1080p/t-8/">8</a> <a href="/search/1080p/t-9/">9</a> <a href="/search/1080p/t-10/">10</a> <a href="/search/1080p/t-11/">11</a> <a href="/search/1080p/t-12/">12</a> <a href="/search/1080p/t-13/">13</a> <a href="/search/1080p/t-14/">14</a> <a href="/search/1080p/t-15/">15</a> <a href="/search/1080p/t-16/">16</a> <a href="/search/1080p/t-17/">17</a> <a href="/search/1080p/t-18/">18</a> <a href="/search/1080p/t-19/">19</a> <a href="/search/1080p/t-20/">20</a> <a href="/search/1080p/t-21/">21</a> <a href="/search/1080p/t-22/">22</a> <a href="/search/1080p/t-23/">23</a> <a href="/search/1080p/t-2/">Next</a> <a href="/search/1080p/t-255/">Last</a> </div>

The results I'm getting are like [curtailed to the last five links]:

https://www.yify-torrent.org/search/1080p/t-20/
https://www.yify-torrent.org/search/1080p/t-21/
https://www.yify-torrent.org/search/1080p/t-22/
https://www.yify-torrent.org/search/1080p/t-23/
https://www.yify-torrent.org/search/1080p/t-255/

Answer provided by @kaze obviously should return you 255 pages, but if you need to get all links dynamically without hardcoding total pages number, you might try

r = requests.get("https://www.yify-torrent.org/search/1080p/")
tree = html.fromstring(r.content)
page_number = tree.xpath("//div[@class='pager']/a[.='Last']/@href")[0].split("/")[-2].replace("t-", "")

for page in range(int(page_number) + 1):
    requests.get("https://www.yify-torrent.org/search/1080p/t-%s/" % page)

if the link structure isn't infereable you would have to 'walk the site', but here you might as well generate the links yourself, like so:

for i in range(1,256):
    print('https://www.yify-torrent.org/search/1080p/t-%s/' % i)

Your script looks correct as it is. Looking at the HTML for that page, I see this:

 <a href="/search/1080p/t-2/">2</a> <a href="/search/1080p/t-3/">3</a> <a href="/search/1080p/t-4/">4</a> <a href="/search/1080p/t-5/">5</a> <a href="/search/1080p/t-6/">6</a> <a href="/search/1080p/t-7/">7</a> <a href="/search/1080p/t-8/">8</a> <a href="/search/1080p/t-9/">9</a> <a href="/search/1080p/t-10/">10</a> <a href="/search/1080p/t-11/">11</a> <a href="/search/1080p/t-12/">12</a> <a href="/search/1080p/t-13/">13</a> <a href="/search/1080p/t-14/">14</a> <a href="/search/1080p/t-15/">15</a> <a href="/search/1080p/t-16/">16</a> <a href="/search/1080p/t-17/">17</a> <a href="/search/1080p/t-18/">18</a> <a href="/search/1080p/t-19/">19</a> <a href="/search/1080p/t-20/">20</a> <a href="/search/1080p/t-21/">21</a> <a href="/search/1080p/t-22/">22</a> <a href="/search/1080p/t-23/">23</a> <a href="/search/1080p/t-2/">Next</a> <a href="/search/1080p/t-255/">Last</a> 

It seems t-2 is a pointer to the Next page, which will contain the rest of the links.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM