简体   繁体   中英

How can I navigate and extract restaurant details from zaubee.com when the href attribute is set to "#" for each restaurant link?

How can I scrape the zaubee.com website to extract business details from each restaurant's page when the href attribute is set to "#" in scrapy??

I'm presently working on a web scraping project that will gather company information from the zaubee.com website. However, the href parameter for each restaurant link is set to # , preventing me from visiting the various restaurant sites and gathering the needed data.

import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule


class zaubeeSpider(scrapy.Spider):
    name = 'zaubeeerestaurant'
    allowed_domains = ['www.zaubee.com']
    start_urls = ['https://zaubee.com/category/restaurant-in-fredonia-hclq6jom']

def parse(self, response):
    restaurantlink = response.xpath("//div[@class='search-result__title-wrapper']/h2")
    for restaurant in restaurantlink:
        name= restaurant.xpath(".//text()").get()
        link = restaurant.xpath(".//@href").get()
        yield {
            'name':name,
            'link':link
        }
        yield response.follow(url=link,callback =self.parse_restaurant)


def parse_restaurant(self,response):
    name = response.xpath("//h1[@class='postcard__title postcard__title--claimed']/text()").get()
    website = response.xpath("(//a[@class='profile__website__link']/@href)[1]").get()
    address = response.xpath("(//address[@class='profile__address--compact']/text())[1]").get()

    yield{
        'name':name,
        "website":website,
        'address':address
    }

I've previously created a scraping solution using Scrapy, but I need help overcoming this challenge. What method or workaround can I use to visit each restaurant's page and get the necessary information?

OUTPUT FOR ONE ENTRY:

2023-06-04 23:38:10 [scrapy.core.scraper] DEBUG: Scraped from <200 [https://zaubee.com/category/restaurant-in-fredonia-hclq6jom](https://zaubee.com/category/restaurant-in-fredonia-hclq6jom)>
{'name': 'Restaurants in Fredonia New York', 'link': '#'}

When it try to get inside link shown below

2023-06-04 23:38:12 [scrapy.core.scraper] DEBUG: Scraped from <200 [https://zaubee.com/category/restaurant-in-fredonia-hclq6jom](https://zaubee.com/category/restaurant-in-fredonia-hclq6jom)>
{'name': None, 'website': None, 'address': None}

I'm trying to get inside each restaurant link and collect restaurant name, address, telephone, timing for particular link.

It's just that your xpath selectors are wrong.

import scrapy
import unicodedata
import re


class zaubeeSpider(scrapy.Spider):
    name = 'zaubeeerestaurant'
    start_urls = ['https://zaubee.com/category/restaurant-in-fredonia-hclq6jom']
    allowed_domains = ['zaubee.com']

    def parse(self, response):
        restaurants = response.xpath('//div[@data-value]')
        for restaurant in restaurants:
            name = restaurant.xpath('.//h3/text()[not(span)]').getall()
            name = ''.join(name).strip()
            link = restaurant.xpath(".//a/@href").get(default='')
            yield {
                'name': name,
                'link': response.urljoin(link)
            }
            yield response.follow(url=link, callback=self.parse_restaurant)

    def parse_restaurant(self,response):
        name = response.xpath('//h1/text()').get()
        website = response.xpath('//a[@rel]/@href').get(default='')
        website = re.sub(r'//', r'https://', website)
        address = response.xpath('//div[contains(@class, "address")]/span[last()]/text()').get(default='')
        address = unicodedata.normalize("NFKD", address).replace('\n', ' ').strip()

        yield{
            'name': name,
            "website": website,
            'address': address
        }

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM