scrapy LxmlLinkExtractor and relative urls

Question

the correct url I should end up to with my Rule is : http://www.lecture-en-ligne.com/towerofgod/168/0/0/1.html

scrapys get the relative url well from the source :

<a class="table" href="../../towerofgod/168/0/0/1.html">Lire en ligne</a>

but it then crawls badly thinking the double-points-slash-double-points are part of the next url to get...

should I transform the double relative url I got from the LxmlLinkExtractor with a custom process_value ?

Is scrapy handling relative url correctly, I mean is it intended behaviour ?

2014-12-06 17:20:05+0100 [togspider] DEBUG: Crawled (200) http://www.lecture-en-ligne.com/manga/towerofgod/> (referer: None)

2014-12-06 17:20:05+0100 [togspider] DEBUG: Retrying http://www.lecture-en-ligne.com/../../towerofgod/160/0/0/1.html> (failed 1 times): 400 Bad Request

class TogSpider(CrawlSpider):
name = "togspider"
allowed_domains = ["lecture-en-ligne.com"]
start_urls = ["http://www.lecture-en-ligne.com/manga/towerofgod/"]

rules = (
    Rule(LxmlLinkExtractor(allow_domains=allowed_domains,
                           restrict_xpaths='.//*[@id="page"]/table[2]/tbody/tr[10]/td[2]/a'), callback='parse_chapter'),
    )

Answer 1

The problem is that the HTML has an incorrect HTML base element , which is supposed to specify the base url for all the relative links in the page:

<base href="http://www.lecture-en-ligne.com/"/>

Scrapy is respecting that, that's why the links are being formed that way.

scrapy LxmlLinkExtractor and relative urls

Question

1 answers

solution1
1 ACCPTED 2014-12-06 17:31:06

scrapy LxmlLinkExtractor and relative urls

Question

1 answers

solution1 1 ACCPTED 2014-12-06 17:31:06

solution1
1 ACCPTED 2014-12-06 17:31:06