简体   繁体   中英

Scrapy, recursive crawling with different XPathSelector

Good evening an thanks for help.

I am digging through Scrappy, my need is get informations from a website and recreate the same tree structure of the site. example:

books [
python [
    first [
    title = 'Title'
    author = 'John Doe'
    price = '200'
    ]

    first [
    title = 'Other Title'
    author = 'Mary Doe'
    price = '100'
    ]
]

php [
    first [
        title = 'PhpTitle'
        author = 'John Smith'
        price = '100'
        ]

        first [
        title = 'Php Other Title'
        author = 'Mary Smith'
        price = '300'
    ]
]
]

from tutorial i have correctly done my basic spider :

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from pippo.items import PippoItem

class PippoSpider(BaseSpider):
    name = "pippo"
    allowed_domains = ["www.books.net"]
    start_urls = [
        "http://www.books.net/index.php"
    ]

    def parse(self, response):
        hxs = HtmlXPathSelector(response)
        sites = hxs.select('//div[@id="28008_LeftPane"]/div/ul/li')
        items = []
        for site in sites:
        item = PippoItem()
        item['subject'] = site.select('a/b/text()').extract()
        item['link'] = site.select('a/@href').extract()
        items.append(item)
        return items

My problem is that any level of my structure is one level deeper in site so if in my basic level I get the subjects of book i need then to crawl the correspondent itemitem['link'] to get the other items. But in the next urls i will need a different HtmlXPathSelector to correcly extract my data, and so on until the end of the structure.

Could you please basically help me and put me in the rigth way? Thank you.

You will need to make the Requests for link manually: (also see CrawlSpider )

from urlparse import urljoin

from scrapy.http import Request
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector

from pippo.items import PippoItem

class PippoSpider(BaseSpider):
    name = "pippo"
    allowed_domains = ["www.books.net"]
    start_urls = ["http://www.books.net/"]

    def parse(self, response):
        hxs = HtmlXPathSelector(response)
        sites = hxs.select('//div[@id="28008_LeftPane"]/div/ul/li')

        for site in sites:
            item = PippoItem()
            item['subject'] = site.select('.//text()').extract()
            item['link'] = site.select('.//a/@href').extract()
            link = item['link'][0] if len(item['link']) else None
            if link:
                yield Request(urljoin(response.url, link),
                    callback=self.parse_link,
                    errback=lambda _: item,
                    meta=dict(item=item),
                    )
            else:
                yield item

    def parse_link(self, response):
        item = response.meta.get('item')
        item['alsothis'] = 'more data'
        return item

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM