简体   繁体   中英

Why is xpath's extract() returning an empty list for the href attribute of an anchor element?

Why do I get an empty list when trying to extract the href attribute of the anchor tag located on the following url: https://www.udemy.com/courses/search/?src=ukw&q=accounting using scrapy?

This is my code to extract the <a></a> element located inside the list-view-course-card--course-card-wrapper--TJ6ET class:

response.xpath("//div[@class='list-view-course-card--course-card-wrapper--TJ6ET']/a/@href").extract()

This site makes API calls to retrieve all the data. You can use the scrapy shell to see the response that the site is returning. scrapy shell 'https://www.udemy.com/courses/search/?src=ukw&q=accounting' and then view(response) .

The data you are looking for is available at the following api call : ' https://www.udemy.com/api-2.0/search-courses/?fields[locale]=simple_english_title&src=ukw&q=accounting ' . However, if you try to access this link directly, you will get a json object saying that you do not have permission to perform this action. How did I find this link ? Load the url on your browser, and go to the network tab on your developer tools and look for XHR objects.

The following spider will first make a request to the primary link and then make a request to the api call. You will have to parse the json object that was returned to obtain your data. If you want to scale this spider for more products, you might want to look for a pattern in the structure of the api call.

import scrapy

class UdemySpider(scrapy.Spider):

    name = 'udemy'
    newurl = 'https://www.udemy.com/api-2.0/search-courses/?fields[locale]=simple_english_title&src=ukw&q=accounting'

    def start_requests(self):
        urls = ['https://www.udemy.com/courses/search/?src=ukw&q=accounting'

        ]
        for url in urls:
            yield scrapy.Request(url=url, callback=self.api_call)

    def api_call(self, response):
        print("Working on second page")
        yield scrapy.Request(url=self.newurl, callback=self.parse)

    def parse(self, response):
        #code to parse json object
`

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM