简体   繁体   English

如何使用 Scrapy Python 从重定向链接中提取网站 URL

[英]How to extract the website URL from the redirect link with Scrapy Python

I wrote a script to get the data from a website.我编写了一个脚本来从网站获取数据。 I have issue with collecting the website URL since the @href is the redirect link.我在收集网站 URL 时遇到问题,因为 @href 是重定向链接。 How can I convert the redirect URL to the actual website it's redirecting to?如何将重定向 URL 转换为它重定向到的实际网站?

import scrapy
import logging


class AppSpider(scrapy.Spider):
    name = 'app'
    allowed_domains = ['www.houzz.in']
    start_urls = ['https://www.houzz.in/professionals/searchDirectory?topicId=26721&query=Design-Build+Firms&location=Mumbai+City+District%2C+India&distance=100&sort=4']

    def parse(self, response):
        lists = response.xpath('//li[@class="hz-pro-search-results__item"]/div/div[@class="hz-pro-search-result__info"]/div/div/div/a')
        for data in lists:
            link = data.xpath('.//@href').get()

            yield scrapy.Request(url=link, callback=self.parse_houses, meta={'Links': link})

        next_page = response.xpath('(//a[@class="hz-pagination-link hz-pagination-link--next"])[1]/@href').extract_first()
        if next_page:
            yield response.follow(response.urljoin(next_page), callback=self.parse)

    def parse_houses(self, response):
        link = response.request.meta['Links']

        firm_name = response.xpath('//div[@class="hz-profile-header__title"]/h1/text()').get()
        name = response.xpath('//div[@class="profile-meta__val"]/text()').get()
        phone = response.xpath('//div[@class="hz-profile-header__contact-info text-right mrm"]/a/span/text()').get()
        website = response.xpath('(//div[@class="hz-profile-header__contact-info text-right mrm"]/a)[2]/@href').get()

        yield {
            'Links': link,
            'Firm_name': firm_name,
            'Name': name,
            'Phone': phone,
            'Website': website
        }

You must to have do a request to that target URL to see where it leads to您必须向该目标 URL 发出请求,才能查看它通往何处

In your case, you can do simply the HEAD request, that will not load any body of target URL so that will save bandwidth and increase speed of your script as well在您的情况下,您可以简单地执行HEAD请求,这不会加载目标 URL 的任何主体,这样可以节省带宽并提高脚本的速度

def parse_houses(self, response):
    link = response.request.meta['Links']

    firm_name = response.xpath('//div[@class="hz-profile-header__title"]/h1/text()').get()
    name = response.xpath('//div[@class="profile-meta__val"]/text()').get()
    phone = response.xpath('//div[@class="hz-profile-header__contact-info text-right mrm"]/a/span/text()').get()
    website = response.xpath('(//div[@class="hz-profile-header__contact-info text-right mrm"]/a)[2]/@href').get()

    yield Request(url=website, 
        method="HEAD", 
        callback=self.get_final_link,
        meta={'data': 
                {
                'Links': link,
                'Firm_name': firm_name,
                'Name': name,
                'Phone': phone,
                'Website': website
            }
        }
        )


def get_final_link(self, response):
    data = response.meta['data']
    data['website'] = response.headers['Location']
    yield data

If your goal is to get the website, that actual website link is available in source-code of each listing as well, you can grab it by regex, no need to visit the encrypted url如果您的目标是获取网站,那么每个列表的源代码中也提供了实际的网站链接,您可以通过正则表达式获取它,无需访问加密的 url

def parse_houses(self, response):
    link = response.request.meta['Links']

    firm_name = response.xpath('//div[@class="hz-profile-header__title"]/h1/text()').get()
    name = response.xpath('//div[@class="profile-meta__val"]/text()').get()
    phone = response.xpath('//div[@class="hz-profile-header__contact-info text-right mrm"]/a/span/text()').get()
    website = re.findall(r"\"url\"\: \"(.*?)\"", response.text)[0]

you can do st like this:你可以这样做:

class AppSpider(scrapy.Spider):
    base_url = 'www.houzz.in{}'
    .
    .
    .
    def foo(self):
        actual_url = self.base_url.format(redirect_url)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM