the correct url I should end up to with my Rule is : http://www.lecture-en-ligne.com/towerofgod/168/0/0/1.html
scrapys get the relative url well from the source :
<a class="table" href="../../towerofgod/168/0/0/1.html">Lire en ligne</a>
but it then crawls badly thinking the double-points-slash-double-points are part of the next url to get...
should I transform the double relative url I got from the LxmlLinkExtractor with a custom process_value ?
Is scrapy handling relative url correctly, I mean is it intended behaviour ?
2014-12-06 17:20:05+0100 [togspider] DEBUG: Crawled (200) http://www.lecture-en-ligne.com/manga/towerofgod/> (referer: None)
2014-12-06 17:20:05+0100 [togspider] DEBUG: Retrying http://www.lecture-en-ligne.com/../../towerofgod/160/0/0/1.html> (failed 1 times): 400 Bad Request
class TogSpider(CrawlSpider):
name = "togspider"
allowed_domains = ["lecture-en-ligne.com"]
start_urls = ["http://www.lecture-en-ligne.com/manga/towerofgod/"]
rules = (
Rule(LxmlLinkExtractor(allow_domains=allowed_domains,
restrict_xpaths='.//*[@id="page"]/table[2]/tbody/tr[10]/td[2]/a'), callback='parse_chapter'),
)
The problem is that the HTML has an incorrect HTML base
element , which is supposed to specify the base url for all the relative links in the page:
<base href="http://www.lecture-en-ligne.com/"/>
Scrapy is respecting that, that's why the links are being formed that way.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.