简体   繁体   中英

scrapy does not return text

I am new to scrapy learning. Want to crawl webpages. Before starting with complete project was exploring the Command Line. From the web page crawl I was able to extract the links under the H3 tag with below command

sel.xpath("//h3//@href").extract()

This extracted all the urls. But later realised that the links in the websites are paginated. I was able to know the total number of pages by manually go through pages. But I thought of extracting from the first page because it has the information at bottom as

Page 1 of 100

under a div tag

<div class="pagination-meta">
    Page 1 of 100
</div>

I tried with the following command for extracting the details. But it returned with [] alone. Please correct me if I am wrong

sel.xpath('//div[@class="pagination_meta"]/text()').extract();

I tried the below since the div of pagination-meta was under two other divs

<div class="search-pagination-top bb box-sizing-content">
    <div class="grid_3 column alpha tmargin">
        <div class="pagination-meta">
        Page 1 of 100
        </div>
    </div>
</div>


sel.xpath('//div[@class="search-pagination-top bb box-sizing-content"]//div/text()').extract();
    [u'Page 1 of 100']

Is this the correct way to do it? Why does not my first command did not give the exact content?

It will work if you use:

sel.xpath('//div[@class="pagination-meta"]/text()').extract();

Since you are matching the exact string, an underscore and a dash certainly will make a difference.

There are many ways to reach the same result. The second way you did it is also correct. Many times it's necessary to obtain a context in one or more location steps , in order to navigate using a relative XPath expression to your final selection step. That happens when you have pages which may change, or a structure which may change.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM