I am new to scrapy learning. Want to crawl webpages. Before starting with complete project was exploring the Command Line. From the web page crawl I was able to extract the links under the H3 tag with below command
sel.xpath("//h3//@href").extract()
This extracted all the urls. But later realised that the links in the websites are paginated. I was able to know the total number of pages by manually go through pages. But I thought of extracting from the first page because it has the information at bottom as
Page 1 of 100
under a div tag
<div class="pagination-meta">
Page 1 of 100
</div>
I tried with the following command for extracting the details. But it returned with [] alone. Please correct me if I am wrong
sel.xpath('//div[@class="pagination_meta"]/text()').extract();
I tried the below since the div of pagination-meta was under two other divs
<div class="search-pagination-top bb box-sizing-content">
<div class="grid_3 column alpha tmargin">
<div class="pagination-meta">
Page 1 of 100
</div>
</div>
</div>
sel.xpath('//div[@class="search-pagination-top bb box-sizing-content"]//div/text()').extract();
[u'Page 1 of 100']
Is this the correct way to do it? Why does not my first command did not give the exact content?
It will work if you use:
sel.xpath('//div[@class="pagination-meta"]/text()').extract();
Since you are matching the exact string, an underscore and a dash certainly will make a difference.
There are many ways to reach the same result. The second way you did it is also correct. Many times it's necessary to obtain a context in one or more location steps , in order to navigate using a relative XPath expression to your final selection step. That happens when you have pages which may change, or a structure which may change.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.