简体   繁体   English

Scrapy spider输出empy csv文件

[英]Scrapy spider outputs empy csv file

This is my first question here and I'm learning how to code by myself so please bear with me. 这是我的第一个问题,我正在学习如何自己编码,所以请耐心等待。

I'm working on a final CS50 project which I'm trying to built a website that aggregates online Spanish course from edx.org and other open online couses websites maybe. 我正在制作一个最终的CS50项目,我正在尝试建立一个网站,汇集来自edx.org和其他开放在线couses网站的在线西班牙语课程。 I'm using scrapy framework to scrap the filter results of Spanish courses on edx.org... Here is my first scrapy spider which I'm trying to get in each courses link to then get it's name (after I get the code right, also get the description, course url and more stuff). 我正在使用scrapy框架来删除edx.org上西班牙语课程的过滤结果...这是我的第一个scrapy蜘蛛,我试图在每个课程链接中获取它的名字(在我得到正确的代码之后) ,也获得描述,课程网址和更多的东西)。

from scrapy.item import Field, Item
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractor import LinkExtractor
from scrapy.loader import ItemLoader

class Course_item(Item):
    name = Field()
    #description = Field()
    #img_url = Field()


class Course_spider(CrawlSpider):
    name = 'CourseSpider'
    allowed_domains = ['https://www.edx.org/']
    start_urls = ['https://www.edx.org/course/?language=Spanish']

    rules = (Rule(LinkExtractor(allow=r'/course'), callback='parse_item', follow='True'),)

    def parse_item(self, response):
        item = ItemLoader(Course_item, response)
        item.add_xpath('name', '//*[@id="course-intro-heading"]/text()')

        yield item.load_item()

When I run the spider with "scrapy runspider edxSpider.py -o edx.csv -t csv" I get an empty csv file and I also think is not getting into the right spanish courses results. 当我使用“scrapy runspider edxSpider.py -o edx.csv -t csv”运行蜘蛛时,我得到一个空的csv文件,我也认为没有进入正确的西班牙语课程结果。

Basically I want to get in each courses of this link edx Spanish courses and get the name, description, provider, page url and img url. 基本上我想进入这个链接edx西班牙语课程的每个课程,并获得名称,描述,提供者,页面网址和img网址。

Any ideas for why might be the problem? 任何想法可能是什么问题?

You can't get edx content with a simple request, it uses javascript rendering for getting the course element dynamically, so CrawlSpider won't work on this case, because you need to find specific elements inside the response body to generate a new Request that will get what you need. 您无法通过简单的请求获取edx内容,它使用javascript呈现来动态获取课程元素,因此CrawlSpider不会在这种情况下工作,因为您需要在响应主体内找到特定元素以生成新的Request将得到你需要的。

The real request (to get the urls of the courses) is this one , but you need to generate it from the previous response body (although you could just visit it an also get the correct data). 真实的请求(获取课程的网址)就是这个 ,但是您需要从之前的响应主体生成它(尽管您可以访问它并获得正确的数据)。

So, to generate the real request, you need data that is inside a script tag: 因此,要生成实际请求,您需要script标记内的数据:

from scrapy import Spider
import re
import json

class Course_spider(Spider):
    name = 'CourseSpider'
    allowed_domains = ['edx.org']
    start_urls = ['https://www.edx.org/course/?language=Spanish']

    def parse(self, response):
        script_text = response.xpath('//script[contains(text(), "Drupal.settings")]').extract_first()
        parseable_json_data = re.search(r'Drupal.settings, ({.+})', script_text).group(1)
        json_data = json.loads(parseable_json_data)
        ...

Now you have what you need on json_data and only need to create the string URL. 现在,您在json_data上拥有所需的json_data ,只需要创建字符串URL。

This page use JavaScript to get data from server and add to page. 此页面使用JavaScript从服务器获取数据并添加到页面。

It uses urls like 它使用网址

https://www.edx.org/api/catalog/v2/courses/course-v1:IDBx+IDB33x+3T2017

Last part is course's number which you can find in HTML 最后一部分是您可以在HTML中找到的课程编号

<main id="course-info-page" data-course-id="course-v1:IDBx+IDB33x+3T2017">

Code

from scrapy.http import Request
from scrapy.item import Field, Item
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractor import LinkExtractor
from scrapy.loader import ItemLoader
import json

class Course_spider(CrawlSpider):

    name = 'CourseSpider'
    allowed_domains = ['www.edx.org']
    start_urls = ['https://www.edx.org/course/?language=Spanish']

    rules = (Rule(LinkExtractor(allow=r'/course'), callback='parse_item', follow='True'),)

    def parse_item(self, response):
        print('parse_item url:', response.url)

        course_id = response.xpath('//*[@id="course-info-page"]/@data-course-id').extract_first()

        if course_id:
            url = 'https://www.edx.org/api/catalog/v2/courses/' + course_id
            yield Request(url, callback=self.parse_json)

    def parse_json(self, response):
        print('parse_json url:', response.url)

        item = json.loads(response.body)

        return item

from scrapy.crawler import CrawlerProcess

c = CrawlerProcess({
    'USER_AGENT': 'Mozilla/5.0',
    'FEED_FORMAT': 'csv',     # csv, json, xml
    'FEED_URI': 'output.csv', #     
})
c.crawl(Course_spider)
c.start()
from scrapy.http import Request
from scrapy import Spider
import json


class edx_scraper(Spider):

name = "edxScraper"
start_urls = [
    'https://www.edx.org/api/v1/catalog/search?selected_facets[]=content_type_exact%3Acourserun&selected_facets[]=language_exact%3ASpanish&page=1&page_size=9&partner=edx&hidden=0&content_type[]=courserun&content_type[]=program&featured_course_ids=course-v1%3AHarvardX+CS50B+Business%2Ccourse-v1%3AMicrosoft+DAT206x+1T2018%2Ccourse-v1%3ALinuxFoundationX+LFS171x+3T2017%2Ccourse-v1%3AHarvardX+HDS2825x+1T2018%2Ccourse-v1%3AMITx+6.00.1x+2T2017_2%2Ccourse-v1%3AWageningenX+NUTR101x+1T2018&featured_programs_uuids=452d5bbb-00a4-4cc9-99d7-d7dd43c2bece%2Cbef7201a-6f97-40ad-ad17-d5ea8be1eec8%2C9b729425-b524-4344-baaa-107abdee62c6%2Cfb8c5b14-f8d2-4ae1-a3ec-c7d4d6363e26%2Ca9cbdeb6-5fc0-44ef-97f7-9ed605a149db%2Cf977e7e8-6376-400f-aec6-84dcdb7e9c73'
]

def parse(self, response):
    data = json.loads(response.text)
    for course in data['objects']['results']:
        url = 'https://www.edx.org/api/catalog/v2/courses/' + course['key']
        yield response.follow(url, self.course_parse)

    if 'next' in data['objects'] is not None:
        yield response.follow(data['objects']['next'], self.parse)

def course_parse(self, response):
    course = json.loads(response.text)
    yield{
        'name': course['title'],
        'effort': course['effort'],
    }

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM