[英]How can I navigate and extract restaurant details from zaubee.com when the href attribute is set to "#" for each restaurant link?
当 scrapy 中的 href 属性设置为“#”时,如何抓取 zaubee.com 网站以从每个餐厅的页面中提取业务详细信息?
我目前正在从事 web 抓取项目,该项目将从zaubee.com网站收集公司信息。 但是,每个餐厅链接的 href 参数都设置为#
,这使我无法访问各个餐厅网站并收集所需的数据。
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
class zaubeeSpider(scrapy.Spider):
name = 'zaubeeerestaurant'
allowed_domains = ['www.zaubee.com']
start_urls = ['https://zaubee.com/category/restaurant-in-fredonia-hclq6jom']
def parse(self, response):
restaurantlink = response.xpath("//div[@class='search-result__title-wrapper']/h2")
for restaurant in restaurantlink:
name= restaurant.xpath(".//text()").get()
link = restaurant.xpath(".//@href").get()
yield {
'name':name,
'link':link
}
yield response.follow(url=link,callback =self.parse_restaurant)
def parse_restaurant(self,response):
name = response.xpath("//h1[@class='postcard__title postcard__title--claimed']/text()").get()
website = response.xpath("(//a[@class='profile__website__link']/@href)[1]").get()
address = response.xpath("(//address[@class='profile__address--compact']/text())[1]").get()
yield{
'name':name,
"website":website,
'address':address
}
我之前使用 Scrapy 创建了一个抓取解决方案,但我需要帮助来克服这一挑战。 我可以使用什么方法或解决方法来访问每家餐厅的页面并获取必要的信息?
OUTPUT 对于一个条目:
2023-06-04 23:38:10 [scrapy.core.scraper] DEBUG: Scraped from <200 [https://zaubee.com/category/restaurant-in-fredonia-hclq6jom](https://zaubee.com/category/restaurant-in-fredonia-hclq6jom)>
{'name': 'Restaurants in Fredonia New York', 'link': '#'}
当它试图进入如下所示的内部链接时
2023-06-04 23:38:12 [scrapy.core.scraper] DEBUG: Scraped from <200 [https://zaubee.com/category/restaurant-in-fredonia-hclq6jom](https://zaubee.com/category/restaurant-in-fredonia-hclq6jom)>
{'name': None, 'website': None, 'address': None}
我试图进入每个餐厅链接并收集餐厅名称、地址、电话、特定链接的时间。
只是您的 xpath 选择器是错误的。
import scrapy
import unicodedata
import re
class zaubeeSpider(scrapy.Spider):
name = 'zaubeeerestaurant'
start_urls = ['https://zaubee.com/category/restaurant-in-fredonia-hclq6jom']
allowed_domains = ['zaubee.com']
def parse(self, response):
restaurants = response.xpath('//div[@data-value]')
for restaurant in restaurants:
name = restaurant.xpath('.//h3/text()[not(span)]').getall()
name = ''.join(name).strip()
link = restaurant.xpath(".//a/@href").get(default='')
yield {
'name': name,
'link': response.urljoin(link)
}
yield response.follow(url=link, callback=self.parse_restaurant)
def parse_restaurant(self,response):
name = response.xpath('//h1/text()').get()
website = response.xpath('//a[@rel]/@href').get(default='')
website = re.sub(r'//', r'https://', website)
address = response.xpath('//div[contains(@class, "address")]/span[last()]/text()').get(default='')
address = unicodedata.normalize("NFKD", address).replace('\n', ' ').strip()
yield{
'name': name,
"website": website,
'address': address
}
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.