[英]How to use meta to get data from all the links in a website
As I am new to python I need your help . 由于我是python新手,因此需要您的帮助。 I need to crawl data from all the links in a website.
我需要从网站中的所有链接中检索数据。 I used meta to go into the link and to get data.
我使用meta来进入链接并获取数据。 When i use my code i can get from only one link.
当我使用我的代码时,我只能从一个链接获得。
import scrapy
from scrapy.spiders import CrawlSpider
from scrapy.selector import Selector
import urlparse
from alibaba.items import AlibabaItem
import mysql.connector
from mysql.connector import conversion
import re
class RedditCrawler(CrawlSpider):
name = 'baba'
allowed_domains = ['http://india.alibaba.com/']
start_urls = ['http://india.alibaba.com/supplier_list.htm?SearchText=automobile+parts&bizType=1']
custom_settings = {
'BOT_NAME': 'alibaba',
'DEPTH_LIMIT': 8,
'DOWNLOAD_DELAY': 0.5
}
def parse(self, response):
s = Selector(response)
next_link = s.xpath('//a[@class="next"]/@href').extract_first()
full_link = urlparse.urljoin('http://india.alibaba.com/',next_link)
yield self.make_requests_from_url(full_link)
item=AlibabaItem()
item['Name']=s.xpath('//div[@class="corp corp2"]//h2/a/text()').extract()
item['address']=s.xpath('//div[@class="value grcolor"]/text()').extract()
item['Annual_Revenue']=s.xpath('//div[@class="attrs"]//div[2]//div[@class="value"]//text()').extract()
item['Main_Markets']=s.xpath('//div[@class="attrs"]//div[3]//div[@class="value"]//text()').extract()
item['main_products']=s.xpath('//div[@class="value ph"]//text()').extract()
full_link1=s.xpath('//h2[@class="title ellipsis yrtil"]/a//@href').extract_first()
absolute_link = urlparse.urljoin('http://india.alibaba.com/',full_link1)
request_variable = scrapy.Request(absolute_link,callback=self.parse_website,dont_filter=True)
request_variable.meta['parcel_stuff'] = item
yield request_variable
def parse_website(self,response):
s = Selector(response)
item = response.meta['parcel_stuff']
item['Year_Established']=s.xpath('//table//tr[4]//td//a[@class="message-send mc-click-target"]//text()').extract()
yield item
You've based your RedditCrawler on CrawlSpider which already has a parse() of its own. 您已将RedditCrawler基于CrawlSpider,而CrawlSpider已经具有自己的parse()。 Change your class to RedditCrawler(scrapy.Spider).
将您的类更改为RedditCrawler(scrapy.Spider)。 The docs are here , but the important part is
文档在这里 ,但重要的部分是
Warning
警告
When writing crawl spider rules, avoid using parse as callback, since the CrawlSpider uses the parse method itself to implement its logic.
编写爬网蜘蛛规则时,请避免将解析用作回调,因为CrawlSpider使用解析方法本身来实现其逻辑。 So if you override the parse method, the crawl spider will no longer work.
因此,如果您覆盖parse方法,则爬网蜘蛛将不再起作用。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.