当每个餐厅链接的 href 属性设置为“#”时，如何从 zaubee.com 导航和提取餐厅详细信息？

Question

当 scrapy 中的 href 属性设置为“#”时，如何抓取 zaubee.com 网站以从每个餐厅的页面中提取业务详细信息？

我目前正在从事 web 抓取项目，该项目将从zaubee.com网站收集公司信息。 但是，每个餐厅链接的 href 参数都设置为# ，这使我无法访问各个餐厅网站并收集所需的数据。

import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule


class zaubeeSpider(scrapy.Spider):
    name = 'zaubeeerestaurant'
    allowed_domains = ['www.zaubee.com']
    start_urls = ['https://zaubee.com/category/restaurant-in-fredonia-hclq6jom']

def parse(self, response):
    restaurantlink = response.xpath("//div[@class='search-result__title-wrapper']/h2")
    for restaurant in restaurantlink:
        name= restaurant.xpath(".//text()").get()
        link = restaurant.xpath(".//@href").get()
        yield {
            'name':name,
            'link':link
        }
        yield response.follow(url=link,callback =self.parse_restaurant)


def parse_restaurant(self,response):
    name = response.xpath("//h1[@class='postcard__title postcard__title--claimed']/text()").get()
    website = response.xpath("(//a[@class='profile__website__link']/@href)[1]").get()
    address = response.xpath("(//address[@class='profile__address--compact']/text())[1]").get()

    yield{
        'name':name,
        "website":website,
        'address':address
    }

我之前使用 Scrapy 创建了一个抓取解决方案，但我需要帮助来克服这一挑战。 我可以使用什么方法或解决方法来访问每家餐厅的页面并获取必要的信息？

OUTPUT 对于一个条目：

2023-06-04 23:38:10 [scrapy.core.scraper] DEBUG: Scraped from <200 [https://zaubee.com/category/restaurant-in-fredonia-hclq6jom](https://zaubee.com/category/restaurant-in-fredonia-hclq6jom)>
{'name': 'Restaurants in Fredonia New York', 'link': '#'}

当它试图进入如下所示的内部链接时

2023-06-04 23:38:12 [scrapy.core.scraper] DEBUG: Scraped from <200 [https://zaubee.com/category/restaurant-in-fredonia-hclq6jom](https://zaubee.com/category/restaurant-in-fredonia-hclq6jom)>
{'name': None, 'website': None, 'address': None}

我试图进入每个餐厅链接并收集餐厅名称、地址、电话、特定链接的时间。

Answer 1

只是您的 xpath 选择器是错误的。

import scrapy
import unicodedata
import re


class zaubeeSpider(scrapy.Spider):
    name = 'zaubeeerestaurant'
    start_urls = ['https://zaubee.com/category/restaurant-in-fredonia-hclq6jom']
    allowed_domains = ['zaubee.com']

    def parse(self, response):
        restaurants = response.xpath('//div[@data-value]')
        for restaurant in restaurants:
            name = restaurant.xpath('.//h3/text()[not(span)]').getall()
            name = ''.join(name).strip()
            link = restaurant.xpath(".//a/@href").get(default='')
            yield {
                'name': name,
                'link': response.urljoin(link)
            }
            yield response.follow(url=link, callback=self.parse_restaurant)

    def parse_restaurant(self,response):
        name = response.xpath('//h1/text()').get()
        website = response.xpath('//a[@rel]/@href').get(default='')
        website = re.sub(r'//', r'https://', website)
        address = response.xpath('//div[contains(@class, "address")]/span[last()]/text()').get(default='')
        address = unicodedata.normalize("NFKD", address).replace('\n', ' ').strip()

        yield{
            'name': name,
            "website": website,
            'address': address
        }

当每个餐厅链接的 href 属性设置为“#”时，如何从 zaubee.com 导航和提取餐厅详细信息？

问题描述

1 个解决方案

解决方案1
0 2023-06-06 16:28:39

当每个餐厅链接的 href 属性设置为“#”时，如何从 zaubee.com 导航和提取餐厅详细信息？

问题描述

1 个解决方案

解决方案1 0 2023-06-06 16:28:39

解决方案1
0 2023-06-06 16:28:39