简体   繁体   English

如何使用Scrapy抓取具有分页的网站?

[英]How to crawl a website that has pagination using Scrapy?

I am trying to crawl a website that has pagination. 我正在尝试抓取具有分页的网站。 If i click on "next" button at the bottom of page, New items will be generated. 如果我单击页面底部的“下一步”按钮,将生成新项目。 My scrapy program is not able to fetch dynamic data. 我的scrapy程序无法获取动态数据。 Is there way i can fetch this data? 有什么办法可以获取此数据?

HTML of next button looks like below 下一个按钮的HTML如下所示

<div id="morePaginationID">

<a href="javascript:void(0);" onclick="lazyPagingNew('db')"></a>

and My Spider is 我的蜘蛛是

class ExampleSpider(CrawlSpider):

    name = "example"
    domain_name = "example.com"
    allowed_domains = ["example.com"]
    start_urls = ["http://example.com/beauty/90?utm_source=viewallbea"]
    rules = ( Rule(SgmlLinkExtractor(allow=('.*',),restrict_xpaths=('//div[@id="morePaginationID"]',)), callback = "parse_zero" , follow= True), )
    def parse_zero(self,response):
        hxs = HtmlXPathSelector(response)
        paths = hxs.select('//div[@id="containerDiv"]/div[@id="loadFilterResults"]/ul[@id="categoryPageListing"]/li')
        m = len(paths)
        for i in range(m):

            item = ExampleItem()

            item["dealUrl"] = paths[i].select("figure/figcaption/a/@href").extract()[0]

            url = str(item["Url"])
            yield Request(url, callback=self.parselevelone, meta={"item":item})
        spider = ExampleSpider()
    def parselevelone(self, response):
        hxs = HtmlXPathSelector(response)
        item = response.meta["item"]
        item["Title2"] = hxs.select('//div[@class="fullDetail"]/div/figure/figcaption/h2/text()').extract()[0]
        items.append(item)
        return item

What you need to do is this: 您需要做的是:

1) Open Firefox 1)打开Firefox

2) Run FireBug console 2)运行FireBug控制台

3) GO to the search results page 3)转到搜索结果页面

4) Since the results are changing dynamically and not going to another page, a Javascript code is calling another URL(API) for the next page results 4)由于结果是动态变化的,不会转到另一页,因此Javascript代码正在为下一页结果调用另一个URL(API)

5) See the Firebug console for THIS url 5)有关此URL,请参阅Firebug控制台

6) You need to set Scrapy to call this same URL that the Javascript function is calling. 6)您需要将Scrapy设置为调用与Javascript函数相同的URL。 It will most probably return either a JSON or an XML formatted array of results, which is easy to manipulate in Python 它很可能会返回JSON或XML格式的结果数组,这在Python中易于操作

7) Most likely it will have a 'pageNo' variable. 7)最有可能具有'pageNo'变量。 So iterate through the page numbers and fetch the results! 因此,遍历页码并获取结果!

Two ways you can choose! 您可以选择两种方式! First,you can catch the http request pack to get the JSON or XML origin address than you can crawl them directly.Second,may be you should use spider with crawl javascript function such as pyspider project https://github.com/binux/pyspider 首先,您可以捕获http请求包以获取JSON或XML原始地址,而不是直接对其进行爬网。 其次 ,可能是您应将Spider与具有爬网javascript功能的应用程序一起使用,例如pyspider项目https://github.com/binux/ pyspider

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM