I try to scrape articles from this website. So far the scheme I have made are:
import scrapy
from scrapy.crawler import CrawlerProcess
class weeklymining(scrapy.Spider):
name = 'weeklymining'
start_urls = ['https://www.miningweekly.com/page/coal/page:'+str(x) for x in range(1,4)]
def parse(self, response):
for link in response.xpath('//*[@class="en-serif"]/a/@href'):
yield scrapy.Request(
url=link.get(),
callback=self.parse_item
)
def parse_item(self, response):
yield {
'date': response.xpath('//*[@class="sml"]/p/span[1]').get(),
'category': 'coal',
'title': response.xpath('//*[@class="article_title"]/h2/a/text()').get(),
'text':''.join([x.get().strip() for x in response.xpath('//*[@id="article_content_container"]//p//text()')])
}
if __name__ == '__main__':
process = CrawlerProcess()
process.crawl(weeklymining)
process.start()
when I run the code, it gives me error: 2022-07-14 23:14:28 [scrapy.core.scraper] ERROR: Spider error processing <GET https://www.miningweekly.com/page/coal/page:1> (referer: None)
I can open the url, but it seems the code can not processing the url.
How should I fix my code in order to get the required information from the website?
Thank you so much for any help
The error is because this line url=link.get()
produces a relative url which cannot be crawled. You can either concatenate it with the base url to form the full url or use the response.follow
shortcut such as below:
import scrapy
from scrapy.crawler import CrawlerProcess
class weeklymining(scrapy.Spider):
name = 'weeklymining'
start_urls = ['https://www.miningweekly.com/page/coal/page:'+str(x) for x in range(1,4)]
def parse(self, response):
for link in response.xpath('//*[@class="en-serif"]/a/@href'):
yield response.follow(
url=link.get(),
callback=self.parse_item
)
def parse_item(self, response):
yield {
'date': response.xpath('//*[@class="sml"]/p/span[1]').get(),
'category': 'coal',
'title': response.xpath('//*[@class="article_title"]/h2/a/text()').get(),
'text':''.join([x.get().strip() for x in response.xpath('//*[@id="article_content_container"]//p//text()')])
}
if __name__ == '__main__':
process = CrawlerProcess()
process.crawl(weeklymining)
process.start()
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.