如何使用meta从网站中的所有链接获取数据

Question

As I am new to python I need your help . 由于我是python新手，因此需要您的帮助。 I need to crawl data from all the links in a website. 我需要从网站中的所有链接中检索数据。 I used meta to go into the link and to get data. 我使用meta来进入链接并获取数据。 When i use my code i can get from only one link. 当我使用我的代码时，我只能从一个链接获得。

import scrapy
from scrapy.spiders import CrawlSpider
from scrapy.selector import Selector
import urlparse
from alibaba.items import AlibabaItem
import mysql.connector
from mysql.connector import conversion
import re




class RedditCrawler(CrawlSpider):
    name = 'baba'
    allowed_domains = ['http://india.alibaba.com/']
    start_urls = ['http://india.alibaba.com/supplier_list.htm?SearchText=automobile+parts&bizType=1']
    custom_settings = {
        'BOT_NAME': 'alibaba',
        'DEPTH_LIMIT': 8,
        'DOWNLOAD_DELAY': 0.5
        }


    def parse(self, response):
        s = Selector(response)
        next_link = s.xpath('//a[@class="next"]/@href').extract_first()
        full_link = urlparse.urljoin('http://india.alibaba.com/',next_link)
        yield self.make_requests_from_url(full_link)
        item=AlibabaItem()
        item['Name']=s.xpath('//div[@class="corp corp2"]//h2/a/text()').extract()
        item['address']=s.xpath('//div[@class="value grcolor"]/text()').extract()
        item['Annual_Revenue']=s.xpath('//div[@class="attrs"]//div[2]//div[@class="value"]//text()').extract()
        item['Main_Markets']=s.xpath('//div[@class="attrs"]//div[3]//div[@class="value"]//text()').extract()
        item['main_products']=s.xpath('//div[@class="value ph"]//text()').extract()

        full_link1=s.xpath('//h2[@class="title ellipsis yrtil"]/a//@href').extract_first()
        absolute_link = urlparse.urljoin('http://india.alibaba.com/',full_link1)
        request_variable = scrapy.Request(absolute_link,callback=self.parse_website,dont_filter=True)
        request_variable.meta['parcel_stuff'] = item
        yield request_variable

    def parse_website(self,response):
        s = Selector(response)

        item = response.meta['parcel_stuff']
        item['Year_Established']=s.xpath('//table//tr[4]//td//a[@class="message-send mc-click-target"]//text()').extract()

        yield item

Answer 1

You've based your RedditCrawler on CrawlSpider which already has a parse() of its own. 您已将RedditCrawler基于CrawlSpider，而CrawlSpider已经具有自己的parse（）。 Change your class to RedditCrawler(scrapy.Spider). 将您的类更改为RedditCrawler（scrapy.Spider）。 The docs are here , but the important part is 文档在这里，但重要的部分是

Warning 警告

When writing crawl spider rules, avoid using parse as callback, since the CrawlSpider uses the parse method itself to implement its logic. 编写爬网蜘蛛规则时，请避免将解析用作回调，因为CrawlSpider使用解析方法本身来实现其逻辑。 So if you override the parse method, the crawl spider will no longer work. 因此，如果您覆盖parse方法，则爬网蜘蛛将不再起作用。

如何使用meta从网站中的所有链接获取数据

问题描述

1 个解决方案

解决方案1
0 2016-01-19 16:05:26

如何使用meta从网站中的所有链接获取数据

问题描述

1 个解决方案

解决方案1 0 2016-01-19 16:05:26

解决方案1
0 2016-01-19 16:05:26