Here is my Scrapy code ...
import scrapy
class NewsSpider(scrapy.Spider):
name = "news"
start_urls = ['http://www.StartURL.com/scrapy/all-news-listing']
allowed_domains = ["www.xxxxx.com"]
def parse(self, response):
for news in response.xpath('head'):
yield {
'pagetype': news.xpath('//meta[@name="pdknpagetype"]/@content').extract(),
'pagetitle': news.xpath('//meta[@name="pdknpagetitle"]/@content').extract(),
'pageurl': news.xpath('//meta[@name="pdknpageurl"]/@content').extract(),
'pagedate': news.xpath('//meta[@name="pdknpagedate"]/@content').extract(),
'pagedescription': news.xpath('//meta[@name="pdknpagedescription"]/@content').extract(),
'bodytext': [' '.join(item.split()) for item in (response.xpath('//div[@class="module__contentp"]/*/node()/text()').extract())],
}
next_page = response.css('p a::attr(href)').extract_first()
if next_page is not None:
next_page = response.urljoin(next_page)
yield scrapy.Request(next_page, callback=self.parse)
My start_urls page looks like the following. It a very simple page and list all the 3000 links/URLs I want to crawl ...
<html>
<head>
<div>
<p><a href="http://www.xxxxx.com/asdas-sdf/kkm">Page 1</a></p>
<p><a href="http://www.xxxxx.com/vdfvd-asda/vdfvf/dfvd">Page 2</a></p>
<p><a href="http://www.xxxxx.com/oiijo/uoiu/xwswd">Page 3</a></p>
<p><a href="http://www.xxxxx.com/jnkjn-yutyy/hjj-sdf/plm">Page 4</a></p>
<p><a href="http://www.xxxxx.com/unhb-oiiuio/hbhb/jhjh/qwer">Page 5</a></p>
<p><a href="http://www.xxxxx.com/eres/popo-hbhh/oko-sdf/ynyt">Page 6</a></p>
<p><a href="http://www.xxxxx.com/yhbb-ytyu/oioi/rtgb/ttyht">Page 7</a></p>
..........
<p><a href="http://www.xxxxx.com/iojoij/uhuh/page3000">Page 3000</a></p>
</div>
</head>
</html>
When I send Scrapy to this page, it just crawl the first link ie http://www.xxxxx.com/page1 and stops. No errors reported. Seems like this recursion part is not quite working... ! So how do I modify this code to go to each of these 3000 urls and then fetch some specific fields.
I saw in some other similar problems, people have used "Rules" and Scrapy's "LinkExtractor" object? I am not sure if I need either of these as my requirements are very simple.
Any help is very appreciated. Thanks
Each time you request a page like http://www.xxxxx.com/page1
, you may get same result on next_page = response.css('pa::attr(href)').extract_first()
if the page's page bar dose not change. There is better way to do it :
start_urls = ['http://www.xxxxx.com/page{}'.format(i) for i in range(the last page number)]
In this way, you do not need to use callback.
And allowed_domains = ["www.xxxxx.com"]
is not required in this code, this maybe another reason.
As I doubted, it was indeed a flaw in the recursion logic.
The following code solved my problem....
from scrapy.selector import Selector
from scrapy.spider import BaseSpider
from scrapy.http import Request
class MySpider(BaseSpider):
name = "pdknnews"
start_urls = ['http://www.example.com/scrapy/all-news-listing/']
allowed_domains = ["example.com"]
def parse(self, response):
hxs = Selector(response)
for news in response.xpath('head'):
yield {
'pagetype': news.xpath('.//meta[@name="pdknpagetype"]/@content').extract(),
'pagetitle': news.xpath('.//meta[@name="pdknpagetitle"]/@content').extract(),
'pageurl': news.xpath('.//meta[@name="pdknpageurl"]/@content').extract(),
'pagedate': news.xpath('.//meta[@name="pdknpagedate"]/@content').extract(),
'pagedescription': news.xpath('.//meta[@name="pdknpagedescription"]/@content').extract(),
'bodytext': [' '.join(item.split()) for item in (response.xpath('.//div[@class="module__content"]/*/node()/text()').extract())],
}
for url in hxs.xpath('//ul[@class="scrapy"]/li/a/@href').extract():
yield Request(url, callback=self.parse)
The last 2 lines did the recursion magic ...
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.