[英]scraping web page containing anchor tag <a href = “#”> using scrapy
I am scraping manulife 我正在刮man
I want to go to the next page, when I inspect the "next" I get : 当我检查“下一个”时,我想转到下一页:
<span class="pagerlink">
<a href="#" id="next" title="Go to the next page">Next</a>
</span>
What could be the right approach to follow? 遵循的正确方法是什么?
# -*- coding: utf-8 -*-
import scrapy
import json
from scrapy_splash import SplashRequest
class Manulife(scrapy.Spider):
name = 'manulife'
#allowed_domains = ['https://manulife.taleo.net/careersection/external_global/jobsearch.ftl?lang=en']
start_urls = ['https://manulife.taleo.net/careersection/external_global/jobsearch.ftl?lang=en&location=1038']
def start_requests(self):
for url in self.start_urls:
yield SplashRequest(
url,
self.parse,
args={'wait': 5},
)
def parse(self, response):
#yield {
# 'demo' : response.css('div.absolute > span > a::text').extract()
# }
urls = response.css('div.absolute > span > a::attr(href)').extract()
for url in urls:
url = "https://manulife.taleo.net" + url
yield SplashRequest(url = url, callback = self.parse_details, args={'wait': 5})
#self.log("reaced22 : "+ url)
#hitting next button
#data = json.loads(response.text)
#self.log("reached 22 : "+ data)
#next_page_url =
if next_page_url:
next_page_url = response.urljoin(next_page_url)
yield SplashRequest(url = next_page_url, callback = self.parse, args={'wait': 5})
def parse_details(self,response):
yield {
'Job post' : response.css('div.contentlinepanel > span.titlepage::text').extract(),
'Location' : response.xpath("//span[@id = 'requisitionDescriptionInterface.ID1679.row1']/text()").extract(),
'Organization' : response.xpath("//span[@id = 'requisitionDescriptionInterface.ID1787.row1']/text()").extract(),
'Date posted' : response.xpath("//span[@id = 'requisitionDescriptionInterface.reqPostingDate.row1']/text()").extract(),
'Industry': response.xpath("//span[@id = 'requisitionDescriptionInterface.ID1951.row1']/text()").extract()
}
As you can see, the code contains the SplashRequest while hitting the next page link. 如您所见,该代码在单击下一页链接时包含SplashRequest。
I am novice in scraping, somewhere I found that website can return the response as json also. 我是新手,在某个地方,我发现网站还可以将响应作为json返回。 I tried it , but it is giving me the error that " No json object could be decoded"
我尝试过,但是它给了我“无法解码json对象”的错误
I think using css selector ".pagerlink a[title='Go to the next page']"
like this could work. 我认为使用CSS选择器
".pagerlink a[title='Go to the next page']"
可以这样。
But ".pagerlink:last-child a"
would be the best approach imo. 但是
".pagerlink:last-child a"
将是imo的最佳方法。 You just have to get the href attribute 您只需要获取href属性
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.