简体   繁体   English

获得关注链接的最佳方法抓取网络抓取工具

[英]Best way to get follow links scrapy web crawler

So I'm trying to write a spider to continue clicking a next button on a webpage until it can't anymore (or until I add some logic to make it stop). 因此,我尝试编写蜘蛛以继续单击网页上的next按钮,直到无法再单击(或直到添加一些逻辑使其停止为止)为止。 The code below correctly gets the link to the next page but prints it only once. 下面的代码可正确获取到下一页的链接,但仅打印一次。 My question is why isn't it "following" the links that each next button leads to? 我的问题是为什么它不“跟随”每个下一个按钮导致的链接?

class MyprojectSpider(scrapy.Spider):
    name = 'redditbot'
    allowed_domains = ['https://www.reddit.com/r/nfl/?count=25&after=t3_7ax8lb']
    start_urls = ['https://www.reddit.com/r/nfl/?count=25&after=t3_7ax8lb']

    def parse(self, response):
        hxs = HtmlXPathSelector(response)
        next_page = hxs.select('//div[@class="nav-buttons"]//a/@href').extract()
        if next_page:
            yield Request(next_page[1], self.parse)
            print(next_page[1])

To go to the next page, instead of printing the link you just need to yield a scrapy.Request object like the following code: 要转到下一页,无需打印链接,您只需产生一个scrapy.Request object如以下代码所示:

import scrapy

class MyprojectSpider(scrapy.Spider):
    name = 'myproject'
    allowed_domains = ['reddit.com']
    start_urls = ['https://www.reddit.com/r/nfl/']

    def parse(self, response):
        posts = response.xpath('//div[@class="top-matter"]')
        for post in posts:
            # Get your data here
            title = post.xpath('p[@class="title"]/a/text()').extract()
            print(title)
            # Go to next page
            next_page = response.xpath('//span[@class="next-button"]/a/@href').extract_first()
            if next_page:
                 yield scrapy.Request(response.urljoin(next_page), callback=self.parse)

Update: Previous code was wrong, needed to use the absolute URL and also some Xpaths were wrong, this new one should work. 更新:以前的代码是错误的,需要使用绝对URL,并且某些Xpaths是错误的,这一新代码应该可以使用。

Hope it helps! 希望能帮助到你!

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM