简体   繁体   中英

urlparse.urljoin() not handling invalid parent directories

Is there a way to account for "invalid" parent directories when constructing an absolute URL from a relative one, or should I just use .replace() ?

>>> from urlparse import urljoin
>>> url = urljoin('http://www.example.com/path/', '../../../index.html')
>>> url
'http://www.example.com/../../index.html'
>>> url.replace('../', '')
'http://www.example.com/index.html'

Better yet, is there a cleaner way to sanitize urls when scraping in Python?

As you said, it doesn't make sense. You can go higher from the root directory. So normalizing the second part would be difficult without knowing the intent of the author. Only you know how to correctly sanitize it. :)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM