简体   繁体   English

Scrapy仅抓取第一页,其余部分不抓取

[英]Scrapy Crawls only 1st page and not the rest

heya I am making a Project using scrapy in which I need to scrap the business details from a business directory http://directory.thesun.co.uk/find/uk/computer-repair 嘿,我正在使用scrapy制作一个项目,在该项目中,我需要从业务目录http://directory.thesun.co.uk/find/zh/computer-repair中擦除业务详细信息
the problem I am facing is: when I am trying to crawl the page my crawler fetches the details of only 1st page whereas I need to fetch the details of the rest 9 pages also; 我面临的问题是:当我尝试爬网页面时,我的爬网程序仅获取第一页的详细信息,而我还需要获取其余9页的详细信息; that is all 10 pages.. i am showing below my Spider code and items.py and settings .py please see my code and help me to solve it 这就是全部的10页。.我在我的Spider代码以及items.py和settings.py下面显示,请查看我的代码并帮助我解决它

spider code:: 蜘蛛代码::

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from project2.items import Project2Item

class ProjectSpider(BaseSpider):
    name = "project2spider"
    allowed_domains = ["http://directory.thesun.co.uk/"]
    start_urls = [
        "http://directory.thesun.co.uk/find/uk/computer-repair"
    ]

    def parse(self, response):
        hxs = HtmlXPathSelector(response)
        sites = hxs.select('//div[@class="abTbl "]')
        items = []
        for site in sites:
            item = Project2Item()
            item['Catogory'] = site.select('span[@class="icListBusType"]/text()').extract()
            item['Bussiness_name'] = site.select('a/@title').extract()
            item['Description'] = site.select('span[last()]/text()').extract()
            item['Number'] = site.select('span[@class="searchInfoLabel"]/span/@id').extract()
            item['Web_url'] = site.select('span[@class="searchInfoLabel"]/a/@href').extract()
            item['adress_name'] = site.select('span[@class="searchInfoLabel"]/span/text()').extract()
            item['Photo_name'] = site.select('img/@alt').extract()
            item['Photo_path'] = site.select('img/@src').extract()
            items.append(item)
        return items

My items.py code is as follows:: 我的items.py代码如下:

from scrapy.item import Item, Field

class Project2Item(Item):
    Catogory = Field()
    Bussiness_name = Field()
    Description = Field()
    Number = Field()
    Web_url = Field()
    adress_name = Field()
    Photo_name = Field()
    Photo_path = Field()

my settings.py is::: 我的settings.py是:::

BOT_NAME = 'project2'

SPIDER_MODULES = ['project2.spiders']
NEWSPIDER_MODULE = 'project2.spiders'

please help me to extract details from other pages too... 请也帮助我从其他页面中提取详细信息...

if you check the paging links they look like this: 如果检查分页链接,它们将如下所示:

http://directory.thesun.co.uk/find/uk/computer-repair/page/3 http://directory.thesun.co.uk/find/uk/computer-repair/page/2 http://directory.thesun.co.uk/find/uk/computer-repair/page/3 http://directory.thesun.co.uk/find/uk/computer-repair/page/2

You could loop pages using urllib2 with a variable 您可以使用带有变量的urllib2循环页面

import urllib2
response = urllib2.urlopen('http://directory.thesun.co.uk/find/uk/computer-repair/page/' + page)
html = response.read()

and scrape the html. 并抓取html。

Following is the working code. 以下是工作代码。 Scrolling the pages should be taken by studying the website and its scroll structure and apply them acccordingly. 滚动页面应通过研究网站及其滚动结构来进行,并相应地应用它们。 In this case, the website has given it "/page/x" where x is the page number. 在这种情况下,网站为其指定了“ / page / x”,其中x是页码。

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from project2spider.items import Project2Item
from scrapy.http import Request

class ProjectSpider(BaseSpider):
    name = "project2spider"
    allowed_domains = ["http://directory.thesun.co.uk"]
    current_page_no = 1 
    start_urls = [ 
        "http://directory.thesun.co.uk/find/uk/computer-repair"
    ]   

    def get_next_url(self, fired_url):
        if '/page/' in fired_url:
            url, page_no = fired_url.rsplit('/page/', 1)
        else:
            if self.current_page_no != 1:
                #end of scroll
                return 
        self.current_page_no += 1
        return "http://directory.thesun.co.uk/find/uk/computer-repair/page/%s" % self.current_page_no

    def parse(self, response):
        fired_url = response.url
        hxs = HtmlXPathSelector(response)
        sites = hxs.select('//div[@class="abTbl "]')
        for site in sites:
            item = Project2Item()
            item['Catogory'] = site.select('span[@class="icListBusType"]/text()').extract()
            item['Bussiness_name'] = site.select('a/@title').extract()
            item['Description'] = site.select('span[last()]/text()').extract()
            item['Number'] = site.select('span[@class="searchInfoLabel"]/span/@id').extract()
            item['Web_url'] = site.select('span[@class="searchInfoLabel"]/a/@href').extract()
            item['adress_name'] = site.select('span[@class="searchInfoLabel"]/span/text()').extract()
            item['Photo_name'] = site.select('img/@alt').extract()
            item['Photo_path'] = site.select('img/@src').extract()
            yield item
        next_url = self.get_next_url(fired_url)
        if next_url:
            yield Request(next_url, self.parse, dont_filter=True)
`

I try the code that @nizam.sp. 我尝试@ nizam.sp的代码。 posted and this only displays 2 records 1 record(last record) from the main page and 1 record from the second page(random record) and it ends. 已发布,它仅显示2条记录,其中首页有1条记录(最后一条记录),第二页有1条记录(随机记录),并且记录结束。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM