I am trying to scrape the university website in order to get all the information regarding the courses information. But in my spider the parse_course method doesn't seem to be world as it doesn't yield or print anything.
import scrapy
from ..items import UniversityItem
class DuneSpider(scrapy.Spider):
name = 'Dune'
allowed_domains = ['https://www.dundee.ac.uk/']
start_urls = ['https://www.dundee.ac.uk/undergraduate/courses']
def parse(self, response):
courses = response.css(".filterable-list a::attr(href)").extract()
courses_length = len(courses)
for course in range(courses_length):
courses[course] = "https://www.dundee.ac.uk" + courses[course]
print("THE COURSE LINK:\n", courses[1:10])
for course_url in courses:
print("COURSE URL:", course_url)
yield scrapy.Request(course_url, callback=self.parse_course)
def parse_course(self, response):
print("IN PARSE COURSE: ", response.url)
item = UniversityItem()
course_name = response.xpath("//h1[@class='hero__title']/text()").extract()
item['course_name'] = course_name
print(course_name)
yield item['course_name']
Change this:
allowed_domains = ['www.dundee.ac.uk']
and you have to yield item
instead of list
, need to change:
yield item['course_name']
to:
yield item
print
uses standard output, which is not captured in scrapy log by default. You can enable LOG_STDOUT = true
in settings.py
.
A better solution is to use Spider.logger
,
class DuneSpider(scrapy.Spider):
...
def parse_course(self, response):
self.logger.info("IN PARSE COURSE: ", response.url)
...
Update : I missed something. @Roman is right, to process the item in ItemPipeline, you should yield the instance of Item
, not some attribute of item.
The output of the scrapy command shows you what's wrong in multiple places:
2020-09-15 07:43:23 [py.warnings] WARNING: c:\program files\python37\lib\site-packages\scrapy\spidermiddlewares\offsite.py:61: URLWarning: allowed_domains accepts only domains, not URLs. Ignoring URL entry https://www.dundee.ac.uk/ in allowed_domains.
warnings.warn(message, URLWarning)
2020-09-15 07:43:23 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2020-09-15 07:43:23 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.dundee.ac.uk/undergraduate/courses> (referer: None)
2020-09-15 07:43:24 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'www.dundee.ac.uk': <GET https://www.dundee.ac.uk/undergraduate/accountancy-mathematics>
'offsite/domains': 1,
'offsite/filtered': 229,
Your links are detected as offsite because your allowed_domains
is wrong. As the name suggests, it should be a list of domain names, not a list of URLs.
Changeing allowed_domains
to ['www.dundee.ac.uk']
fixes the problem.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.