简体   繁体   English

如何使用meta从网站中的所有链接获取数据

[英]How to use meta to get data from all the links in a website

As I am new to python I need your help . 由于我是python新手,因此需要您的帮助。 I need to crawl data from all the links in a website. 我需要从网站中的所有链接中检索数据。 I used meta to go into the link and to get data. 我使用meta来进入链接并获取数据。 When i use my code i can get from only one link. 当我使用我的代码时,我只能从一个链接获得。

import scrapy
from scrapy.spiders import CrawlSpider
from scrapy.selector import Selector
import urlparse
from alibaba.items import AlibabaItem
import mysql.connector
from mysql.connector import conversion
import re




class RedditCrawler(CrawlSpider):
    name = 'baba'
    allowed_domains = ['http://india.alibaba.com/']
    start_urls = ['http://india.alibaba.com/supplier_list.htm?SearchText=automobile+parts&bizType=1']
    custom_settings = {
        'BOT_NAME': 'alibaba',
        'DEPTH_LIMIT': 8,
        'DOWNLOAD_DELAY': 0.5
        }


    def parse(self, response):
        s = Selector(response)
        next_link = s.xpath('//a[@class="next"]/@href').extract_first()
        full_link = urlparse.urljoin('http://india.alibaba.com/',next_link)
        yield self.make_requests_from_url(full_link)
        item=AlibabaItem()
        item['Name']=s.xpath('//div[@class="corp corp2"]//h2/a/text()').extract()
        item['address']=s.xpath('//div[@class="value grcolor"]/text()').extract()
        item['Annual_Revenue']=s.xpath('//div[@class="attrs"]//div[2]//div[@class="value"]//text()').extract()
        item['Main_Markets']=s.xpath('//div[@class="attrs"]//div[3]//div[@class="value"]//text()').extract()
        item['main_products']=s.xpath('//div[@class="value ph"]//text()').extract()

        full_link1=s.xpath('//h2[@class="title ellipsis yrtil"]/a//@href').extract_first()
        absolute_link = urlparse.urljoin('http://india.alibaba.com/',full_link1)
        request_variable = scrapy.Request(absolute_link,callback=self.parse_website,dont_filter=True)
        request_variable.meta['parcel_stuff'] = item
        yield request_variable

    def parse_website(self,response):
        s = Selector(response)

        item = response.meta['parcel_stuff']
        item['Year_Established']=s.xpath('//table//tr[4]//td//a[@class="message-send mc-click-target"]//text()').extract()

        yield item

You've based your RedditCrawler on CrawlSpider which already has a parse() of its own. 您已将RedditCrawler基于CrawlSpider,而CrawlSpider已经具有自己的parse()。 Change your class to RedditCrawler(scrapy.Spider). 将您的类更改为RedditCrawler(scrapy.Spider)。 The docs are here , but the important part is 文档在这里 ,但重要的部分是

Warning 警告

When writing crawl spider rules, avoid using parse as callback, since the CrawlSpider uses the parse method itself to implement its logic. 编写爬网蜘蛛规则时,请避免将解析用作回调,因为CrawlSpider使用解析方法本身来实现其逻辑。 So if you override the parse method, the crawl spider will no longer work. 因此,如果您覆盖parse方法,则爬网蜘蛛将不再起作用。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM