简体   繁体   中英

how to define start_urls to scrape articles with scrapy?

I try to scrape articles from this website. So far the scheme I have made are:

  • the start_urls can cover all the pages in the website
  • the parse function is to capture the details page where all the information I need is there
  • the parse_item is to get the information I need (date, title, and full text)
  • I use response.xpath to get the information
import scrapy
from scrapy.crawler import CrawlerProcess

class weeklymining(scrapy.Spider):
    name = 'weeklymining'
    start_urls = ['https://www.miningweekly.com/page/coal/page:'+str(x) for x in range(1,4)]

    def parse(self, response):
        for link in response.xpath('//*[@class="en-serif"]/a/@href'):
            yield scrapy.Request(
                url=link.get(),
                callback=self.parse_item
            )

    def parse_item(self, response):
        yield {
            'date': response.xpath('//*[@class="sml"]/p/span[1]').get(),
            'category': 'coal',
            'title': response.xpath('//*[@class="article_title"]/h2/a/text()').get(),
            'text':''.join([x.get().strip() for x in response.xpath('//*[@id="article_content_container"]//p//text()')])
            }
if __name__ == '__main__':
    process = CrawlerProcess()
    process.crawl(weeklymining)
    process.start()

when I run the code, it gives me error: 2022-07-14 23:14:28 [scrapy.core.scraper] ERROR: Spider error processing <GET https://www.miningweekly.com/page/coal/page:1> (referer: None)

I can open the url, but it seems the code can not processing the url.

How should I fix my code in order to get the required information from the website?

Thank you so much for any help

The error is because this line url=link.get() produces a relative url which cannot be crawled. You can either concatenate it with the base url to form the full url or use the response.follow shortcut such as below:

import scrapy
from scrapy.crawler import CrawlerProcess

class weeklymining(scrapy.Spider):
    name = 'weeklymining'
    start_urls = ['https://www.miningweekly.com/page/coal/page:'+str(x) for x in range(1,4)]

    def parse(self, response):
        for link in response.xpath('//*[@class="en-serif"]/a/@href'):
            yield response.follow(
                url=link.get(),
                callback=self.parse_item
            )

    def parse_item(self, response):
        yield {
            'date': response.xpath('//*[@class="sml"]/p/span[1]').get(),
            'category': 'coal',
            'title': response.xpath('//*[@class="article_title"]/h2/a/text()').get(),
            'text':''.join([x.get().strip() for x in response.xpath('//*[@id="article_content_container"]//p//text()')])
            }
if __name__ == '__main__':
    process = CrawlerProcess()
    process.crawl(weeklymining)
    process.start()

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM