简体   繁体   English

刮到showthread.php的下一页

[英]Go to next page on showthread.php with scrapy

I'm new to scrapy. 我是新手。 For about 4 days I'm stuck at go to next page when fetching showthread.php (forum based on vbulletin). 在获取showthread.php(基于vbulletin的论坛)时,我停留在大约4天的时间。

My target: http://forum.femaledaily.com/showthread.php?359-Hair-Smoothing 我的目标: http : //forum.femaledaily.com/showthread.php?359-Hair-Smoothing

import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor

from femaledaily.items import FemaledailyItem

class Femaledaily(scrapy.Spider):
    name = "femaledaily"
    allowed_domains = ["femaledaily.com"]
    start_urls = [
        "http://forum.femaledaily.com/forumdisplay.php?136-Hair-Care",
        "http://forum.femaledaily.com/forumdisplay.php?136-Hair-Care/page2",
        "http://forum.femaledaily.com/forumdisplay.php?136-Hair-Care/page3",
        "http://forum.femaledaily.com/forumdisplay.php?136-Hair-Care/page4",
    ]

    def parse(self, response):
        for thd in response.css("tbody > tr "):
            print "==========NEW THREAD======"
            url = thd.xpath('.//div[@class="threadlist-title"]/a/@href').extract()
            url[0] = "http://forum.femaledaily.com/"+url[0]
            print url[0]
            yield scrapy.Request(url[0], callback=self.parse_thread)

    def parse_thread(self, response):
        for page in response.xpath('//ol[@id="posts"]/li'):
            item = FemaledailyItem()
            item['thread_title'] = response.selector.xpath('//span[@class="threadtitle"]/a/text()').extract()
            # item['thread_starter'] = response.selector.xpath('//div[@class="username_container"]/a/text()').extract_first()
            post_creator = page.xpath('.//div[@class="username_container"]/a/text()').extract()

            if not post_creator:
                item['post_creator'] = page.xpath('.//div[@class="username_container"]/a/span/text()').extract()
            else:
                item['post_creator'] = post_creator

            item['post_content'] = ""

            cot = page.xpath(".//blockquote[@class='postcontent restore ']/text()").extract()
            for ct in cot:
                item['post_content'] += ct.replace('\t','').replace('\n','')

            yield item

I'm able to get first 10 posts for every thread, but I'm confused how to go to next page. 我能够为每个主题获得前10个帖子,但是我对如何转到下一页感到困惑。 Any ideas? 有任何想法吗?

A slight change made in your code so that it will paginate properly, 对您的代码进行了细微的更改,以便正确分页,

import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor

from femaledaily.items import FemaledailyItem

class Femaledaily(scrapy.Spider):
    name = "femaledaily"
    allowed_domains = ["femaledaily.com"]
    BASE_URL = "http://forum.femaledaily.com/"
    start_urls = [
        "http://forum.femaledaily.com/forumdisplay.php?136-Hair-Care",
        "http://forum.femaledaily.com/forumdisplay.php?136-Hair-Care/page2",
        "http://forum.femaledaily.com/forumdisplay.php?136-Hair-Care/page3",
        "http://forum.femaledaily.com/forumdisplay.php?136-Hair-Care/page4",
    ]

    def parse(self, response):
        for thd in response.css("tbody > tr "):
            print "==========NEW THREAD======"
            url = thd.xpath('.//div[@class="threadlist-title"]/a/@href').extract()
            url = "http://forum.femaledaily.com/"+url[0]
            yield scrapy.Request(url, callback=self.parse_thread)

        # pagination
        next_page = response.xpath('//li[@class="prev_next"]/a[@rel="next"]/@href').extract()
        if next_page:
            yield Request(self.BASE_URL  + next_page[0], callback=self.parse)
        else:
            return

    def parse_thread(self, response):
        for page in response.xpath('//ol[@id="posts"]/li'):
            item = FemaledailyItem()
            item['thread_title'] = response.selector.xpath('//span[@class="threadtitle"]/a/text()').extract()
            # item['thread_starter'] = response.selector.xpath('//div[@class="username_container"]/a/text()').extract_first()
            post_creator = page.xpath('.//div[@class="username_container"]/a/text()').extract()

            if not post_creator:
                item['post_creator'] = page.xpath('.//div[@class="username_container"]/a/span/text()').extract()
            else:
                item['post_creator'] = post_creator

            item['post_content'] = ""

            cot = page.xpath(".//blockquote[@class='postcontent restore ']/text()").extract()
            for ct in cot:
                item['post_content'] += ct.replace('\t','').replace('\n','')

            yield item

        # pagination   
        next_page = response.xpath('//li[@class="prev_next"]/a[@rel="next"]/@href').extract()
        if next_page:
            yield Request(self.BASE_URL  + next_page[0], callback=self.parse_thread)
        else:
            return

Here first extract the next page's link (ie, single forward arrow) and giving a request to that next_page_url and make the callback function as the same function from where it is called. 在这里,首先提取下一页的链接(即,单个前进箭头)并向next_page_url发出请求,并使回调函数与调用该函数的位置相同。 When it reaches the last page the next-page-url vanishes and halts. 当到达最后一页时, next-page-url消失并停止。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM