简体   繁体   中英

Scrapy Spider class function call

I am trying to scrape the university website in order to get all the information regarding the courses information. But in my spider the parse_course method doesn't seem to be world as it doesn't yield or print anything.

import scrapy
from ..items import UniversityItem

class DuneSpider(scrapy.Spider):
    name = 'Dune'
    allowed_domains = ['https://www.dundee.ac.uk/']
    start_urls = ['https://www.dundee.ac.uk/undergraduate/courses']

def parse(self, response):
    courses = response.css(".filterable-list a::attr(href)").extract()
    courses_length = len(courses)

    for course in range(courses_length):
        courses[course] = "https://www.dundee.ac.uk" + courses[course]

    print("THE COURSE LINK:\n", courses[1:10])

    for course_url in courses:
        print("COURSE URL:", course_url)
        yield scrapy.Request(course_url, callback=self.parse_course)

def parse_course(self, response):
    print("IN PARSE COURSE: ", response.url)
    item = UniversityItem()
    course_name = response.xpath("//h1[@class='hero__title']/text()").extract()
    item['course_name'] = course_name
    print(course_name)
    yield item['course_name']

Change this:

allowed_domains = ['www.dundee.ac.uk']

and you have to yield item instead of list , need to change:

 yield item['course_name']

to:

 yield item

print uses standard output, which is not captured in scrapy log by default. You can enable LOG_STDOUT = true in settings.py .

A better solution is to use Spider.logger ,

class DuneSpider(scrapy.Spider):
    ...

    def parse_course(self, response):
        self.logger.info("IN PARSE COURSE: ", response.url)
        ...

Update : I missed something. @Roman is right, to process the item in ItemPipeline, you should yield the instance of Item , not some attribute of item.

The output of the scrapy command shows you what's wrong in multiple places:

2020-09-15 07:43:23 [py.warnings] WARNING: c:\program files\python37\lib\site-packages\scrapy\spidermiddlewares\offsite.py:61: URLWarning: allowed_domains accepts only domains, not URLs. Ignoring URL entry https://www.dundee.ac.uk/ in allowed_domains.
  warnings.warn(message, URLWarning)

2020-09-15 07:43:23 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2020-09-15 07:43:23 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.dundee.ac.uk/undergraduate/courses> (referer: None)
2020-09-15 07:43:24 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'www.dundee.ac.uk': <GET https://www.dundee.ac.uk/undergraduate/accountancy-mathematics>

'offsite/domains': 1,
'offsite/filtered': 229,

Your links are detected as offsite because your allowed_domains is wrong. As the name suggests, it should be a list of domain names, not a list of URLs.

Changeing allowed_domains to ['www.dundee.ac.uk'] fixes the problem.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM