简体   繁体   English

如何从管道中执行特定的蜘蛛,而无需再次激活它

[英]How to execute an specific spider from the Pipeline without activating it again

Introduction 介绍

The website I'm scrapping has two urls: 我要剪贴的网站有两个网址:

  • /top lists top players /top列出最佳玩家
  • /player/{name} shows player with name {name} info /player/{name}显示名称为{name}玩家信息

From the first URL, I get the player name and position then I'm able to call the second URL using the given name. 从第一个URL,我获得了玩家的姓名和位置,然后可以使用给定的名称调用第二个URL。 My current goal is to store all the data on a database. 我当前的目标是将所有数据存储在数据库中。

The problem 问题

I created two spiders. 我创造了两个蜘蛛。 The first, which crawls /top and the second which crawls /player/{name} for each player the first spider has found. 对于第一个蜘蛛找到的每个玩家,第一个爬行/top ,第二个爬行/player/{name} However, to be able to insert the first spider data into the database, I need to call the profile spider because it is a foreign key, as noted on the following queries: 但是,为了能够将第一个蜘蛛数据插入数据库,我需要调用配置文件蜘蛛,因为它是外键,如以下查询中所述:

INSERT INTO top_players (player_id, position) values (1, 1)

INSERT INTO players (name) values ('John Doe')

Question

Is it possible to execute a spider from the Pipeline just to get the spider results? 是否可以从管道中执行蜘蛛程序以仅获得蜘蛛程序结果? I mean, the called spider should not activate the pipeline again. 我的意思是,所谓的蜘蛛不应再次激活管道。

i would suggest you to have more control over the scraping process. 我建议您对抓取过程进行更多控制。 Especially with grabbing the name,position from the first page and detail page. 特别是从第一页和详细信息页中获取名称,位置。 try this: 尝试这个:

# -*- coding: utf-8 -*-
import scrapy

class MyItem(scrapy.Item):
    name = scrapy.Field()
    position= scrapy.Field()
    detail=scrapy.Field() 
class MySpider(scrapy.Spider):

    name = '<name of spider>'
    allowed_domains = ['mywebsite.org']
    start_urls = ['http://mywebsite.org/<path to the page>']

    def parse(self, response):

        rows = response.xpath('//a[contains(@href,"<div id or class>")]')

        #loop over all links to stories
        for row in rows:
            myItem = MyItem() # Create a new item
            myItem['name'] = row.xpath('./text()').extract() # assign name from link
            myItem['position']=row.xpath('./text()').extract() # assign position from link
            detail_url = response.urljoin(row.xpath('./@href').extract()[0]) # extract url from link
            request = scrapy.Request(url = detail_url, callback = self.parse_detail) # create request for detail page with story
            request.meta['myItem'] = myItem # pass the item with the request
            yield request

    def parse_detail(self, response):
        myItem = response.meta['myItem'] # extract the item (with the name) from the response
        text_raw = response.xpath('//font[@size=3]//text()').extract() # extract the detail (text)
        myItem['detail'] = ' '.join(map(unicode.strip, text_raw)) # clean up the text and assign to item
        yield myItem # return the item

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM