[英]Decode escaped characters in URL
I have a list containing URLs with escaped characters in them.我有一个包含带有转义字符的 URL 的列表。 Those characters have been set by
urllib2.urlopen
when it recovers the html page: urllib2.urlopen
在恢复 html 页面时已经设置了这些字符:
http://www.sample1webpage.com/index.php?title=%E9%A6%96%E9%A1%B5&action=edit
http://www.sample1webpage.com/index.php?title=%E9%A6%96%E9%A1%B5&action=history
http://www.sample1webpage.com/index.php?title=%E9%A6%96%E9%A1%B5&variant=zh
Is there a way to transform them back to their unescaped form in python?有没有办法在 python 中将它们转换回未转义的形式?
PS: The URLs are encoded in utf-8 PS:URL 以 utf-8 编码
urllib.unquote(
string)
urllib.unquote(
字符串)
Replace
%xx
escapes by their single-character equivalent.用等效的单字符替换
%xx
转义符。
Example:
unquote('/%7Econnolly/')
yields'/~connolly/'
.示例:
unquote('/%7Econnolly/')
产生'/~connolly/'
。
And then just decode.然后只是解码。
Update: For Python 3, write the following:更新:对于 Python 3,编写以下内容:
import urllib.parse
urllib.parse.unquote(url)
And if you are using Python3
you could use:如果您使用的是
Python3
,则可以使用:
import urllib.parse
urllib.parse.unquote(url)
or urllib.unquote_plus
或
urllib.unquote_plus
>>> import urllib
>>> urllib.unquote('erythrocyte+membrane+protein+1%2C+PfEMP1+%28VAR%29')
'erythrocyte+membrane+protein+1,+PfEMP1+(VAR)'
>>> urllib.unquote_plus('erythrocyte+membrane+protein+1%2C+PfEMP1+%28VAR%29')
'erythrocyte membrane protein 1, PfEMP1 (VAR)'
您可以使用urllib.unquote
import re
def unquote(url):
return re.compile('%([0-9a-fA-F]{2})',re.M).sub(lambda m: chr(int(m.group(1),16)), url)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.