Scrapy-下一頁的XPath

Question

我在為站點的“下一頁” URL獲取XPath時遇到了麻煩。

HTML如下：

<div class="pagingcont">

        <div class="right margintop" id="save_search_header_popup" style="width:550px;">
            <div class="left marginleft" style="padding-top:1px;">
                <div class="left save_search_env"><img src="/themes/LW1/refresh/images/envelope_icon.gif" alt="Save" />&nbsp;</div>
                <div class="left">
                    Save this search and receive email alerts of new listings
                    &nbsp;<input type="text" maxlength="100" value="Name this search" onfocus="doSavedSearchFocus(this,'Name this search');" style="width:120px;height:14px;color:Gray;"/>&nbsp;
                </div>
            </div>
            <div class="left save_search_btn" style="margin-right:10px;"><img class="pointer" src="/themes/LW1/refresh/images/btn_save.gif" alt="Save"  onclick="showPopup(document.getElementById('save_search_header_popup'), null, 'In order to be notified of new or updated properties, you need to be registered and signed in.');return false;"/></div>
        </div>
        <div class="left margintop marginleft" style="cursor:pointer;height:27px;" onclick="javascript:docompare(true);">
            <div class="left"><img src="//www.landwatch.com/themes/LW1/images/comparebtn_btm.gif" style="margin-bottom:0px;">&nbsp;&nbsp;</div>
            <div class="left active" style="margin-top:4px;">COMPARE</div>
        </div>
        <div class="clear topline"></div>

    <div class="clear margin">
        <b>Page &nbsp;</b>
        &nbsp;<span class="active" style="padding:3px 3px 3px 4px;border:solid 1px black;">1&nbsp;</span>&nbsp;<a href="https://www.landwatch.com/default.aspx?ct=r&type=5,37;268,6843&=&px=2000000&r.PSIZ=500%2c&pg=2">2</a>&nbsp;|&nbsp;<a href="https://www.landwatch.com/default.aspx?ct=r&type=5,37;268,6843&=&px=2000000&r.PSIZ=500%2c&pg=3">3</a>&nbsp;|&nbsp;<a href="https://www.landwatch.com/default.aspx?ct=r&type=5,37;268,6843&=&px=2000000&r.PSIZ=500%2c&pg=4">4</a>&nbsp;|&nbsp;<a href="https://www.landwatch.com/default.aspx?ct=r&type=5,37;268,6843&=&px=2000000&r.PSIZ=500%2c&pg=5">5</a>&nbsp;|&nbsp;<a href="https://www.landwatch.com/default.aspx?ct=r&type=5,37;268,6843&=&px=2000000&r.PSIZ=500%2c&pg=6">6</a>&nbsp;|&nbsp;<a href="https://www.landwatch.com/default.aspx?ct=r&type=5,37;268,6843&=&px=2000000&r.PSIZ=500%2c&pg=7">7</a>&nbsp;|&nbsp;<a href="https://www.landwatch.com/default.aspx?ct=r&type=5,37;268,6843&=&px=2000000&r.PSIZ=500%2c&pg=8">8</a>&nbsp;|&nbsp;<a href="https://www.landwatch.com/default.aspx?ct=r&type=5,37;268,6843&=&px=2000000&r.PSIZ=500%2c&pg=9">9</a>&nbsp;|&nbsp;<a href="https://www.landwatch.com/default.aspx?ct=r&type=5,37;268,6843&=&px=2000000&r.PSIZ=500%2c&pg=10">10</a>&nbsp;|&nbsp;<a href="https://www.landwatch.com/default.aspx?ct=r&type=5,37;268,6843&=&px=2000000&r.PSIZ=500%2c&pg=11">11</a>&nbsp;|&nbsp;<a href="https://www.landwatch.com/default.aspx?ct=r&type=5,37;268,6843&=&px=2000000&r.PSIZ=500%2c&pg=12">12</a>&nbsp;|&nbsp;<a href="https://www.landwatch.com/default.aspx?ct=r&type=5,37;268,6843&=&px=2000000&r.PSIZ=500%2c&pg=13">13</a>&nbsp;| <a href="https://www.landwatch.com/default.aspx?ct=r&type=5,37;268,6843&=&px=2000000&r.PSIZ=500%2c&pg=2">Next</a>
    </div>

（我要查找的href位於右下角，不方便在此處查看...）

我的努力嘗試了以下方法：

next_page_url = response.xpath("//div[@class='pagingcont']//span//a[text()='Next']/href")
    next_page_url = response.urljoin(next_page_url)

    for href in response.css('div.propName a::attr(href)'):
        url = response.urljoin(href.extract())
        yield scrapy.Request(url, callback=self.parse_product_page)
    yield scrapy.Request(next_page_url, callback=self.parse)

但是每次，拼湊的內容都會給我第一頁結果，然后什么也沒有。 因此，我認為它無法有效地找到下一頁。 那next_page_url有什么問題？

Answer 1

您的xpath有兩個問題：

它正在尋找不在您的數據中的<span>
href是屬性，而不是節點，因此應為@href 。

下面是完整的工作示例。

from scrapy.spiders import Spider
from scrapy import Request

class LandSpider(Spider):
    name = 'myspider'
    start_urls = [
        'https://www.landwatch.com/default.aspx?ct=r&type=5,37;268,6843&=&px=2000000&r.PSIZ=500%2C&pg=1']

    def parse(self, response):
        next_page_url = response.xpath(
            "//div[@class='pagingcont']//a[text()='Next']/@href").extract_first()

        for href in response.css('div.propName a::attr(href)'):
            url = response.urljoin(href.extract())
            yield Request(url, callback=self.parse_product_page)
        yield Request(next_page_url, callback=self.parse)

    def parse_product_page(self, response):
        return response.xpath("//div[@class='detTitle']/text()").extract_first()

結果：

[
{"title": "Lulaton, Brantley County, Coast, GA Land For Sale - 936 Acres"},
{"title": "Oglethorpe County, GA Land For Sale - 515 Acres"},
{"title": "Dawsonville, Lumpkin County, GA Land For Sale - 525 Acres"},
{"title": "Wheeler County, GA Land For Sale - 594 Acres"},
{"title": "Cedartown, Polk County, GA Land For Sale - 1185.65 Acres"},
...
]

Answer 2

首先，為你呈現的HTML例子，沒有span為的父母a標簽，這樣做//span//a是沒有得到任何東西。 因此，也許您的xpath應該僅是：

"//div[@class='pagingcont']//a[text()='Next']/href"

當然可以更好。

現在您也沒有在python代碼上獲取該值，而應該使用.extract_first完成，因此您的第一個next_page_url變量（共享代碼的第一行）是一個Selector ，而不是一個字符串。 更改為：

next_page_url = response.xpath("//div[@class='pagingcont']//a[text()='Next']/href").extract_first()

Scrapy-下一頁的XPath

問題描述

2 個解決方案

解決方案1
2 已采納 2017-12-30 00:37:01

解決方案2
1 2017-12-29 22:11:41

Scrapy-下一頁的XPath

問題描述

2 個解決方案

解決方案1 2 已采納 2017-12-30 00:37:01

解決方案2 1 2017-12-29 22:11:41

解決方案1
2 已采納 2017-12-30 00:37:01

解決方案2
1 2017-12-29 22:11:41