Combining base url with resultant href in scrapy

Question

below is my spider code,

class Blurb2Spider(BaseSpider):
   name = "blurb2"
   allowed_domains = ["www.domain.com"]

   def start_requests(self):
            yield self.make_requests_from_url("http://www.domain.com/bookstore/new")


   def parse(self, response):
       hxs = HtmlXPathSelector(response)
       urls = hxs.select('//div[@class="bookListingBookTitle"]/a/@href').extract()
       for i in urls:
           yield Request(urlparse.urljoin('www.domain.com/', i[1:]),callback=self.parse_url)


   def parse_url(self, response):
       hxs = HtmlXPathSelector(response)
       print response,'------->'

Here i am trying to combine the href link with the base link , but i am getting the following error ,

exceptions.ValueError: Missing scheme in request url: www.domain.com//bookstore/detail/3271993?alt=Something+I+Had+To+Do

Can anyone let me know why i am getting this error and how to join base url with href link and yield a request

Answer 1

An alternative solution, if you don't want to use urlparse :

response.urljoin(i[1:])

This solution goes even a step further: here Scrapy works out the domain base for joining. And as you can see, you don't have to provide the obvious http://www.example.com for joining.

This makes your code reusable in the future if you want to change the domain you are crawling.

Answer 2

It is because you didn't add the scheme, eg http:// in your base url.

Try: urlparse.urljoin('http://www.domain.com/', i[1:])

Or even more easy: urlparse.urljoin(response.url, i[1:]) as urlparse.urljoin will sort out the base URL itself.

Answer 3

The best way to follow a link in scrapy is to use response.follow() . scrapy will handle the rest.

more info

Quote from docs:

Unlike scrapy.Request , response.follow supports relative URLs directly - no need to call urljoin .

Also, you can pass <a> element directly as argument.

Combining base url with resultant href in scrapy

Question

3 answers

solution1
18 2017-10-14 15:33:49

solution2
15 ACCPTED 2012-05-29 12:07:50

solution3
1 2021-02-20 13:07:25

Combining base url with resultant href in scrapy

Question

3 answers

solution1 18 2017-10-14 15:33:49

solution2 15 ACCPTED 2012-05-29 12:07:50

solution3 1 2021-02-20 13:07:25

solution1
18 2017-10-14 15:33:49

solution2
15 ACCPTED 2012-05-29 12:07:50

solution3
1 2021-02-20 13:07:25