[英]Spider not following links - scrapy
我正在嘗試構建一個蜘蛛,該蜘蛛在進入其抓取的頁面之前要經過3頁。 我已經在外殼中測試了響應,但是,它們似乎一起不起作用,我不確定為什么。
我的代碼如下:
# -*- coding: utf-8 -*-
import scrapy
class CollegiateSpider(scrapy.Spider):
name = 'Collegiate'
allowed_domains = ['collegiate-ac.com/uk-student-accommodation']
start_urls = ['http://collegiate-ac.com/uk-student-accommodation/']
# Step 1 - Get the area links
def parse(self, response):
for city in response.xpath('//*[@id="top"]/div[1]/div/div[1]/div/ul/li/a/text').extract():
yield scrapy.Request(response.urljoin("/" + city), callback = self.parse_area_page)
# Step 2 - Get the block links
def parse_area_page(self, response):
for url in response.xpath('//div[3]/div/div/div/a/@href').extract():
yield scrapy.Request(response.urljoin(url), callback=self.parse_unitpage)
# Step 3 Get the room links
def parse_unitpage(self, response):
for url in response.xpath('//*[@id="subnav"]/div/div[2]/ul/li[5]/a/@href').extract():
yield scrapy.Request(response.urljoin(final), callback=self.parse_final)
# Step 4 - Scrape the data
def parse_final(self, response):
pass
我嘗試根據此答案更改為Crawlspider
,但這似乎無濟於事。
我目前正在研究如何調試蜘蛛程序,但是為此付出了很多努力,因此認為在這里獲得意見也將是有益的。
您在'//*[@id="top"]/div[1]/div/div[1]/div/ul/li/a/text()'
中的text()
中忘記了()
但是,我不是使用text()
而是使用@href
來獲取url。
加入urljoin('/' + city)
創建的網址錯誤,因為/
跳過/uk-student-accommodation
您必須使用urljoin(city)
allowed_domains
存在問題-它阻止了大多數網址。
工作示例。 您可以在沒有項目的情況下運行它,並將最終URL保存在output.csv
import scrapy
class CollegiateSpider(scrapy.Spider):
name = 'Collegiate'
allowed_domains = ['collegiate-ac.com']
start_urls = ['https://collegiate-ac.com/uk-student-accommodation/']
# Step 1 - Get the area links
def parse(self, response):
for url in response.xpath('//*[@id="top"]/div[1]/div/div[1]/div/ul/li/a/@href').extract():
url = response.urljoin(url)
#print('>>>', url)
yield scrapy.Request(url, callback=self.parse_area_page)
# Step 2 - Get the block links
def parse_area_page(self, response):
for url in response.xpath('//div[3]/div/div/div/a/@href').extract():
url = response.urljoin(url)
yield scrapy.Request(response.urljoin(url), callback=self.parse_unitpage)
# Step 3 Get the room links
def parse_unitpage(self, response):
for url in response.xpath('//*[@id="subnav"]/div/div[2]/ul/li[5]/a/@href').extract():
url = response.urljoin(url)
yield scrapy.Request(url, callback=self.parse_final)
# Step 4 - Scrape the data
def parse_final(self, response):
# show some information for test
print('>>> parse_final:', response.url)
# send url as item so it can save it in file
yield {'final_url': response.url}
# --- run it without project ---
import scrapy.crawler
c = scrapy.crawler.CrawlerProcess({
"FEED_FORMAT": 'csv',
"FEED_URI": 'output.csv'
})
c.crawl(CollegiateSpider)
c.start()
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.