如何通过Python Scrapy蜘蛛解析嵌入式链接

Question

I am trying to use python's scrappy to extract course catalog information from a website.我正在尝试使用 python 的 scrappy 从网站中提取课程目录信息。 The thing is, each course has a link to its full page and I need to iterate through those pages one by one to extract their information, which later, are fed to an SQL database.问题是，每门课程都有一个指向其完整页面的链接，我需要逐页遍历这些页面以提取它们的信息，然后将其输入到 SQL 数据库中。 Anyhow, I don't know how to change the url's in the spider successively.无论如何，我不知道如何连续更改蜘蛛中的网址。 here attached below is my code so far.下面附上的是我到目前为止的代码。

import scrapy

def find_between(s, first, last):
    try:
        start = s.index(first) + len(first)
        end = s.index(last, start)
        return s[start:end]
    except ValueError:
        return ""

class QuoteSpider(scrapy.Spider):
    name = 'courses'
    start_urls = [
        'http://catalog.aucegypt.edu/content.php?catoid=36&navoid=1738',

    ]

    def parse(self, response):
        # pages in span+ a
        all_courses = response.css('.width a')
        for course in all_courses:
            courseURL = course.xpath('@href').extract()
            cleanCourseURL = find_between(str(courseURL), "['", "']")
            fullURL = "http://catalog.aucegypt.edu/" + cleanCourseURL

            #iterate through urls
            QuoteSpider.start_urls += fullURL
            courseName = response.css('.block_content')


            yield {
                'courseNum': fullURL,
                'test': courseName
            }

Answer 1

Usually you need to yield this new URL and process it with corresponding callback :通常你需要yield这个新的 URL 并用相应的callback处理它：

def parse(self, response):
    # pages in span+ a
    all_courses = response.css('.width a')
    for course in all_courses:
        courseURL = course.xpath('@href').extract()
        cleanCourseURL = find_between(str(courseURL), "['", "']")
        fullURL = "http://catalog.aucegypt.edu/" + cleanCourseURL
        courseName = response.css('.block_content')
        yield scrapy.Request(
            url=fullURL,
            callback=self.parse_course,
            cb_kwargs={
                'course_name': courseName,
            },
        )

def parse_course(self, response, course_name):
    # parse you course here...

如何通过Python Scrapy蜘蛛解析嵌入式链接

问题描述

1 个解决方案

解决方案1
0 已采纳 2020-12-11 07:54:03

如何通过Python Scrapy蜘蛛解析嵌入式链接

问题描述

1 个解决方案

解决方案1 0 已采纳 2020-12-11 07:54:03

解决方案1
0 已采纳 2020-12-11 07:54:03