Making relative paths absolute in python

Question

I want to crawl web page with python, the problem is with relative paths, I have the following functions which normalize and derelativize urls in web page, I can not implement one part of derelativating function. Any ideas? :

def normalizeURL(url):
    if url.startswith('http')==False:
        url = "http://"+url
    if url.startswith('http://www.')==False:
        url = url[:7]+"www."+url[7:]
    return url

def deRelativizePath(url, path):
    url = normalizeURL(url)

    if path.startswith('http'):
        return path
    if path.startswith('/')==False:
        if url.endswith('/'):
            return url+path
        else:
            return url+"/"+path
    else:
        #this part is missing

The problem is: I do not know how to get main url, they can be in many formats:

http://www.example.com
http://www.example.com/
http://www.sub.example.com
http://www.sub.example.com/
http://www.example.com/folder1/file1 #from this I should extract http://www.example.com/ then add path
...

Answer 1

I recommend that you consider using urlparse.urljoin() for this:

Construct a full ("absolute") URL by combining a "base URL" ( base ) with another URL ( url ). Informally, this uses components of the base URL, in particular the addressing scheme, the network location and (part of) the path, to provide missing components in the relative URL.

Answer 2

from urlparse import urlparse

然后解析为各个部分。

Making relative paths absolute in python

Question

2 answers

solution1
4 ACCPTED 2012-05-16 19:32:24

solution2
3 2012-05-16 19:31:44

Making relative paths absolute in python

Question

2 answers

solution1 4 ACCPTED 2012-05-16 19:32:24

solution2 3 2012-05-16 19:31:44

solution1
4 ACCPTED 2012-05-16 19:32:24

solution2
3 2012-05-16 19:31:44