The website I'm scrapping has two urls:
/top
lists top players /player/{name}
shows player with name {name}
info From the first URL, I get the player name and position then I'm able to call the second URL using the given name. My current goal is to store all the data on a database.
I created two spiders. The first, which crawls /top
and the second which crawls /player/{name}
for each player the first spider has found. However, to be able to insert the first spider data into the database, I need to call the profile spider because it is a foreign key, as noted on the following queries:
INSERT INTO top_players (player_id, position) values (1, 1)
INSERT INTO players (name) values ('John Doe')
Is it possible to execute a spider from the Pipeline just to get the spider results? I mean, the called spider should not activate the pipeline again.
i would suggest you to have more control over the scraping process. Especially with grabbing the name,position from the first page and detail page. try this:
# -*- coding: utf-8 -*-
import scrapy
class MyItem(scrapy.Item):
name = scrapy.Field()
position= scrapy.Field()
detail=scrapy.Field()
class MySpider(scrapy.Spider):
name = '<name of spider>'
allowed_domains = ['mywebsite.org']
start_urls = ['http://mywebsite.org/<path to the page>']
def parse(self, response):
rows = response.xpath('//a[contains(@href,"<div id or class>")]')
#loop over all links to stories
for row in rows:
myItem = MyItem() # Create a new item
myItem['name'] = row.xpath('./text()').extract() # assign name from link
myItem['position']=row.xpath('./text()').extract() # assign position from link
detail_url = response.urljoin(row.xpath('./@href').extract()[0]) # extract url from link
request = scrapy.Request(url = detail_url, callback = self.parse_detail) # create request for detail page with story
request.meta['myItem'] = myItem # pass the item with the request
yield request
def parse_detail(self, response):
myItem = response.meta['myItem'] # extract the item (with the name) from the response
text_raw = response.xpath('//font[@size=3]//text()').extract() # extract the detail (text)
myItem['detail'] = ' '.join(map(unicode.strip, text_raw)) # clean up the text and assign to item
yield myItem # return the item
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.