How to use meta to get data from all the links in a website

Question

As I am new to python I need your help . I need to crawl data from all the links in a website. I used meta to go into the link and to get data. When i use my code i can get from only one link.

import scrapy
from scrapy.spiders import CrawlSpider
from scrapy.selector import Selector
import urlparse
from alibaba.items import AlibabaItem
import mysql.connector
from mysql.connector import conversion
import re




class RedditCrawler(CrawlSpider):
    name = 'baba'
    allowed_domains = ['http://india.alibaba.com/']
    start_urls = ['http://india.alibaba.com/supplier_list.htm?SearchText=automobile+parts&bizType=1']
    custom_settings = {
        'BOT_NAME': 'alibaba',
        'DEPTH_LIMIT': 8,
        'DOWNLOAD_DELAY': 0.5
        }


    def parse(self, response):
        s = Selector(response)
        next_link = s.xpath('//a[@class="next"]/@href').extract_first()
        full_link = urlparse.urljoin('http://india.alibaba.com/',next_link)
        yield self.make_requests_from_url(full_link)
        item=AlibabaItem()
        item['Name']=s.xpath('//div[@class="corp corp2"]//h2/a/text()').extract()
        item['address']=s.xpath('//div[@class="value grcolor"]/text()').extract()
        item['Annual_Revenue']=s.xpath('//div[@class="attrs"]//div[2]//div[@class="value"]//text()').extract()
        item['Main_Markets']=s.xpath('//div[@class="attrs"]//div[3]//div[@class="value"]//text()').extract()
        item['main_products']=s.xpath('//div[@class="value ph"]//text()').extract()

        full_link1=s.xpath('//h2[@class="title ellipsis yrtil"]/a//@href').extract_first()
        absolute_link = urlparse.urljoin('http://india.alibaba.com/',full_link1)
        request_variable = scrapy.Request(absolute_link,callback=self.parse_website,dont_filter=True)
        request_variable.meta['parcel_stuff'] = item
        yield request_variable

    def parse_website(self,response):
        s = Selector(response)

        item = response.meta['parcel_stuff']
        item['Year_Established']=s.xpath('//table//tr[4]//td//a[@class="message-send mc-click-target"]//text()').extract()

        yield item

Answer 1

You've based your RedditCrawler on CrawlSpider which already has a parse() of its own. Change your class to RedditCrawler(scrapy.Spider). The docs are here , but the important part is

Warning

When writing crawl spider rules, avoid using parse as callback, since the CrawlSpider uses the parse method itself to implement its logic. So if you override the parse method, the crawl spider will no longer work.

How to use meta to get data from all the links in a website

Question

1 answers

solution1
0 2016-01-19 16:05:26

How to use meta to get data from all the links in a website

Question

1 answers

solution1 0 2016-01-19 16:05:26

solution1
0 2016-01-19 16:05:26