[英]Python - Web scraping using Scrapy
刚开始使用scrapy框架学习网页抓取。 我正在尝试使用以下代码从医学网站上抓取对药物的评论。 但是,如果我运行“scrapy runningpider spiders/medreview.py -o med.csv”,但错误会像“INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)”和med.csv 没有任何数据。
# Importing Scrapy Library
import scrapy
# Creating a new class to implement Spide
class MedSpider(scrapy.Spider):
# Spider name
name = 'reviews'
# Domain names to scrape
allowed_domains = ['1mg.com']
# Base URL for the MacBook air reviews
myBaseUrl = "https://www.1mg.com/otc/becosules-z-capsule-otc63496/amp"
# Defining a Scrapy parser
def parse(self, response):
data = response.css('.OtcPage__reviews-container___hrKgt')
##data = response.css('.ReviewCards__review-card___3Z733')
# Collecting user reviews
comments = data.css('.ReviewCards__review-description___WoLdZ')
count = 0
# Combining the results
for review in comments:
yield{'comment': ''.join(review.xpath('.//text()').extract())
}
count=count+1
根据@stranac 注释添加了“start_urls = myBaseUrl”。 现在我在控制台中遇到了一些错误。
2020-09-28 16:04:34 [scrapy.core.engine] ERROR: Error while obtaining
start requests
Traceback (most recent call last):
File "E:\anaconda\lib\site-packages\scrapy\core\engine.py", line 129, in
_next_request
request = next(slot.start_requests)
File "E:\anaconda\lib\site-packages\scrapy\spiders\__init__.py", line 77, in start_requests
yield Request(url, dont_filter=True)
File "E:\anaconda\lib\site-packages\scrapy\http\request\__init__.py", line 25, in __init__
self._set_url(url)
File "E:\anaconda\lib\site-packages\scrapy\http\request\__init__.py", line 69, in _set_url
raise ValueError('Missing scheme in request url: %s' % self._url)
ValueError: Missing scheme in request url: h
你做错了几件事。 您试图从不存在的页面上抓取评论。 您可以在此处或此处找到评论。 因此,您需要使用任一建议的网址。 要访问数据,必须在请求中定义标头。 鉴于以下是您可以解析数据的一种方式:
import scrapy
class MedSpider(scrapy.Spider):
name = 'reviews'
start_urls = [
# "https://www.1mg.com/otc/becosules-z-capsule-otc63496"
"https://www.1mg.com/otc/becosules-z-capsule-otc63496/reviews"
]
headers = {"User-Agent":"Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.121 Safari/537.36"}
def start_requests(self):
for url in self.start_urls:
yield scrapy.Request(url,callback=self.parse,headers=self.headers)
def parse(self,response):
for review in response.css("[class^='ReviewCards__review-card']"):
reviewer_name = review.css("[class^='ReviewCards__name']::text").get()
reviewer_rating = review.css("[class^='Rating__ratings-container'] > span::text").get()
print(reviewer_name,reviewer_rating)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.