简体   繁体   中英

scrapy LxmlLinkExtractor and relative urls

the correct url I should end up to with my Rule is : http://www.lecture-en-ligne.com/towerofgod/168/0/0/1.html

scrapys get the relative url well from the source :

<a class="table" href="../../towerofgod/168/0/0/1.html">Lire en ligne</a>

but it then crawls badly thinking the double-points-slash-double-points are part of the next url to get...

should I transform the double relative url I got from the LxmlLinkExtractor with a custom process_value ?

Is scrapy handling relative url correctly, I mean is it intended behaviour ?

2014-12-06 17:20:05+0100 [togspider] DEBUG: Crawled (200) http://www.lecture-en-ligne.com/manga/towerofgod/> (referer: None)

2014-12-06 17:20:05+0100 [togspider] DEBUG: Retrying http://www.lecture-en-ligne.com/../../towerofgod/160/0/0/1.html> (failed 1 times): 400 Bad Request

class TogSpider(CrawlSpider):
name = "togspider"
allowed_domains = ["lecture-en-ligne.com"]
start_urls = ["http://www.lecture-en-ligne.com/manga/towerofgod/"]

rules = (
    Rule(LxmlLinkExtractor(allow_domains=allowed_domains,
                           restrict_xpaths='.//*[@id="page"]/table[2]/tbody/tr[10]/td[2]/a'), callback='parse_chapter'),
    )

The problem is that the HTML has an incorrect HTML base element , which is supposed to specify the base url for all the relative links in the page:

<base href="http://www.lecture-en-ligne.com/"/>

Scrapy is respecting that, that's why the links are being formed that way.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM