how to define start_urls to scrape articles with scrapy?

Question

I try to scrape articles from this website. So far the scheme I have made are:

the start_urls can cover all the pages in the website
the parse function is to capture the details page where all the information I need is there
the parse_item is to get the information I need (date, title, and full text)
I use response.xpath to get the information

import scrapy
from scrapy.crawler import CrawlerProcess

class weeklymining(scrapy.Spider):
    name = 'weeklymining'
    start_urls = ['https://www.miningweekly.com/page/coal/page:'+str(x) for x in range(1,4)]

    def parse(self, response):
        for link in response.xpath('//*[@class="en-serif"]/a/@href'):
            yield scrapy.Request(
                url=link.get(),
                callback=self.parse_item
            )

    def parse_item(self, response):
        yield {
            'date': response.xpath('//*[@class="sml"]/p/span[1]').get(),
            'category': 'coal',
            'title': response.xpath('//*[@class="article_title"]/h2/a/text()').get(),
            'text':''.join([x.get().strip() for x in response.xpath('//*[@id="article_content_container"]//p//text()')])
            }
if __name__ == '__main__':
    process = CrawlerProcess()
    process.crawl(weeklymining)
    process.start()

when I run the code, it gives me error: 2022-07-14 23:14:28 [scrapy.core.scraper] ERROR: Spider error processing <GET https://www.miningweekly.com/page/coal/page:1> (referer: None)

I can open the url, but it seems the code can not processing the url.

How should I fix my code in order to get the required information from the website?

Thank you so much for any help

Answer 1

The error is because this line url=link.get() produces a relative url which cannot be crawled. You can either concatenate it with the base url to form the full url or use the response.follow shortcut such as below:

import scrapy
from scrapy.crawler import CrawlerProcess

class weeklymining(scrapy.Spider):
    name = 'weeklymining'
    start_urls = ['https://www.miningweekly.com/page/coal/page:'+str(x) for x in range(1,4)]

    def parse(self, response):
        for link in response.xpath('//*[@class="en-serif"]/a/@href'):
            yield response.follow(
                url=link.get(),
                callback=self.parse_item
            )

    def parse_item(self, response):
        yield {
            'date': response.xpath('//*[@class="sml"]/p/span[1]').get(),
            'category': 'coal',
            'title': response.xpath('//*[@class="article_title"]/h2/a/text()').get(),
            'text':''.join([x.get().strip() for x in response.xpath('//*[@id="article_content_container"]//p//text()')])
            }
if __name__ == '__main__':
    process = CrawlerProcess()
    process.crawl(weeklymining)
    process.start()

how to define start_urls to scrape articles with scrapy?

Question

1 answers

solution1
0 2022-07-14 19:32:44

how to define start_urls to scrape articles with scrapy?

Question

1 answers

solution1 0 2022-07-14 19:32:44

solution1
0 2022-07-14 19:32:44