简体   繁体   English

Scrapy只能抓取第一页

[英]Scrapy Crawls only 1st page

heya I am making a Project using scrapy in which I need to scrap the business details from a business directory http://directory.thesun.co.uk/find/uk/computer-repair 嘿,我正在使用scrapy制作一个项目,在该项目中,我需要从业务目录http://directory.thesun.co.uk/find/zh/computer-repair中擦除业务详细信息
the problem I am facing is: when I am trying to crawl the page my crawler fetches the details of only 1st page whereas I need to fetch the details of the rest 9 pages also; 我面临的问题是:当我尝试爬网页面时,我的爬网程序仅获取第一页的详细信息,而我还需要获取其余9页的详细信息; that is all 10 pages.. i am showing below my Spider code and items.py and settings .py please see my code and help me to solve it 这就是全部的10页。.我在我的Spider代码以及items.py和settings.py下面显示,请查看我的代码并帮助我解决它

spider code:: 蜘蛛代码::

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from project2.items import Project2Item

class ProjectSpider(BaseSpider):
    name = "project2spider"
    allowed_domains = ["http://directory.thesun.co.uk/"]
    start_urls = [
        "http://directory.thesun.co.uk/find/uk/computer-repair"
    ]

    def parse(self, response):
        hxs = HtmlXPathSelector(response)
        sites = hxs.select('//div[@class="abTbl "]')
        items = []
        for site in sites:
            item = Project2Item()
            item['Catogory'] = site.select('span[@class="icListBusType"]/text()').extract()
            item['Bussiness_name'] = site.select('a/@title').extract()
            item['Description'] = site.select('span[last()]/text()').extract()
            item['Number'] = site.select('span[@class="searchInfoLabel"]/span/@id').extract()
            item['Web_url'] = site.select('span[@class="searchInfoLabel"]/a/@href').extract()
            item['adress_name'] = site.select('span[@class="searchInfoLabel"]/span/text()').extract()
            item['Photo_name'] = site.select('img/@alt').extract()
            item['Photo_path'] = site.select('img/@src').extract()
            items.append(item)
        return items

My items.py code is as follows:: 我的items.py代码如下:

from scrapy.item import Item, Field

class Project2Item(Item):
    Catogory = Field()
    Bussiness_name = Field()
    Description = Field()
    Number = Field()
    Web_url = Field()
    adress_name = Field()
    Photo_name = Field()
    Photo_path = Field()

my settings.py is::: 我的settings.py是:::

BOT_NAME = 'project2'

SPIDER_MODULES = ['project2.spiders']
NEWSPIDER_MODULE = 'project2.spiders'

please help me to extract details from other pages too... 请也帮助我从其他页面中提取详细信息...

Fetching description .select('span/text()') you are selecting text from ALL spans in //div[@class="abTbl "] . 正在获取描述.select('span/text()')您将从//div[@class="abTbl "]所有跨度中选择文本。 To extract last span you can use 'span[last()]/text()' xpath 要提取最后一个跨度,可以使用'span[last()]/text()' xpath

btw this http://www.w3schools.com/xpath/xpath_syntax.asp should help you with XPathes 顺便说一下,此http://www.w3schools.com/xpath/xpath_syntax.asp应该可以帮助您使用XPathes

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM