scrapy 未通過鏈接爬行

Question

我正在通過鏈接提取器使用 scrapy 進行爬網，我在 scrapy 鏈接提取器中使用了正確的 XPath 表達式，但我不知道為什么它會無限運行並打印某種餐廳名稱的源代碼。 我知道我的限制 XPath 表達式中有一些錯誤，但無法弄清楚它是什么

代碼：

import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule


class TripadSpider(CrawlSpider):
    name = 'tripad'
    allowed_domains = ['www.tripadvisor.in']
    start_urls = ['https://www.tripadvisor.in/Restaurants-g304551-New_Delhi_National_Capital_Territory_of_Delhi.html']

    rules = (
        Rule(LinkExtractor(restrict_xpaths='//div[@class="OhCyu"]//a'), callback='parse_item', follow=True),
    )

    def parse_item(self, response):
        yield {
            'title': response.xpath('//h1[@class="fHibz"]/text()').get(),
            'Address': response.xpath('(//a[@class="fhGHT"])[2]').get()
        }

Answer 1

它正在爬行，嘗試更改您的 user_agent。 但是您忘記在地址中添加/text() 。

from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule


class TripadSpider(CrawlSpider):
    name = 'tripad'
    allowed_domains = ['tripadvisor.in']
    start_urls = ['https://www.tripadvisor.in/Restaurants-g304551-New_Delhi_National_Capital_Territory_of_Delhi.html']

    rules = (
        Rule(LinkExtractor(restrict_xpaths='//div[@class="OhCyu"]//a'), callback='parse_item', follow=True),
    )

    def parse_item(self, response):
        # from scrapy.shell import inspect_response
        # inspect_response(response, self)
        yield {
            'title': response.xpath('//h1[@class="fHibz"]/text()').get(),
            'Address': response.xpath('(//a[@class="fhGHT"])[2]/text()').get()
        }

Output：

{'title': 'Mosaic', 'Address': 'Sector 10 Lobby Level Crowne Plaza Twin District Centre, Rohini, New Delhi 110085 India'}
{'title': 'Spring', 'Address': 'Plot 4, Dwarka City Centre Radisson Blu, Sector 13, New Delhi 110075 India'}
{'title': 'Dilli 32', 'Address': 'Maharaja Surajmal Road The Leela Ambience Convention Hotel, Near Yamuna Sports Complex, Vivek Vihar, New Delhi 110002 India'}
{'title': 'Viva - All Day Dining', 'Address': 'Hospitality District Asset Area 12 Gurgoan sector 28, New Delhi 110037 India'}
...
...
...

scrapy 未通過鏈接爬行

問題描述

1 個解決方案

解決方案1
0 已采納 2021-12-14 10:14:09

scrapy 未通過鏈接爬行

問題描述

1 個解決方案

解決方案1 0 已采納 2021-12-14 10:14:09

解決方案1
0 已采納 2021-12-14 10:14:09