簡體   English   中英

scrapy 未通過鏈接爬行

[英]scrapy is not crawling through the links

我正在通過鏈接提取器使用 scrapy 進行爬網,我在 scrapy 鏈接提取器中使用了正確的 XPath 表達式,但我不知道為什么它會無限運行並打印某種餐廳名稱的源代碼。 我知道我的限制 XPath 表達式中有一些錯誤,但無法弄清楚它是什么

代碼:

import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule


class TripadSpider(CrawlSpider):
    name = 'tripad'
    allowed_domains = ['www.tripadvisor.in']
    start_urls = ['https://www.tripadvisor.in/Restaurants-g304551-New_Delhi_National_Capital_Territory_of_Delhi.html']

    rules = (
        Rule(LinkExtractor(restrict_xpaths='//div[@class="OhCyu"]//a'), callback='parse_item', follow=True),
    )

    def parse_item(self, response):
        yield {
            'title': response.xpath('//h1[@class="fHibz"]/text()').get(),
            'Address': response.xpath('(//a[@class="fhGHT"])[2]').get()
        }

它正在爬行,嘗試更改您的 user_agent。 但是您忘記在地址中添加/text()

from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule


class TripadSpider(CrawlSpider):
    name = 'tripad'
    allowed_domains = ['tripadvisor.in']
    start_urls = ['https://www.tripadvisor.in/Restaurants-g304551-New_Delhi_National_Capital_Territory_of_Delhi.html']

    rules = (
        Rule(LinkExtractor(restrict_xpaths='//div[@class="OhCyu"]//a'), callback='parse_item', follow=True),
    )

    def parse_item(self, response):
        # from scrapy.shell import inspect_response
        # inspect_response(response, self)
        yield {
            'title': response.xpath('//h1[@class="fHibz"]/text()').get(),
            'Address': response.xpath('(//a[@class="fhGHT"])[2]/text()').get()
        }

Output:

{'title': 'Mosaic', 'Address': 'Sector 10 Lobby Level Crowne Plaza Twin District Centre, Rohini, New Delhi 110085 India'}
{'title': 'Spring', 'Address': 'Plot 4, Dwarka City Centre Radisson Blu, Sector 13, New Delhi 110075 India'}
{'title': 'Dilli 32', 'Address': 'Maharaja Surajmal Road The Leela Ambience Convention Hotel, Near Yamuna Sports Complex, Vivek Vihar, New Delhi 110002 India'}
{'title': 'Viva - All Day Dining', 'Address': 'Hospitality District Asset Area 12 Gurgoan sector 28, New Delhi 110037 India'}
...
...
...

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM