urlparse.urljoin() not handling invalid parent directories

Question

Is there a way to account for "invalid" parent directories when constructing an absolute URL from a relative one, or should I just use .replace() ?

>>> from urlparse import urljoin
>>> url = urljoin('http://www.example.com/path/', '../../../index.html')
>>> url
'http://www.example.com/../../index.html'
>>> url.replace('../', '')
'http://www.example.com/index.html'

Better yet, is there a cleaner way to sanitize urls when scraping in Python?

Answer 1

As you said, it doesn't make sense. You can go higher from the root directory. So normalizing the second part would be difficult without knowing the intent of the author. Only you know how to correctly sanitize it. :)

urlparse.urljoin() not handling invalid parent directories

Question

1 answers

solution1
0 2013-03-30 03:12:50

urlparse.urljoin() not handling invalid parent directories

Question

1 answers

solution1 0 2013-03-30 03:12:50

solution1
0 2013-03-30 03:12:50