简体   繁体   English

Javascript unescape()与Python urllib.unquote()

[英]Javascript unescape() vs. Python urllib.unquote()

From reading various posts, it seems like JavaScript's unescape() is equivalent to Pythons urllib.unquote() , however when I test both I get different results: 通过阅读各种帖子,似乎JavaScript的unescape()等同于Pythons urllib.unquote() ,但是当我测试两者时,我会得到不同的结果:

In browser console: 在浏览器控制台:

unescape('%u003c%u0062%u0072%u003e');

output: <br> 输出: <br>

In Python interpreter: 在Python解释器中:

import urllib
urllib.unquote('%u003c%u0062%u0072%u003e')

output: %u003c%u0062%u0072%u003e 输出: %u003c%u0062%u0072%u003e

I would expect Python to also return <br> . 我希望Python也能返回<br> Any ideas as to what I'm missing here? 关于我在这里缺少什么的想法?

Thanks! 谢谢!

%uxxxx is a non standard URL encoding scheme that is not supported by urllib.parse.unquote() (Py 3) / urllib.unquote() (Py 2). %uxxxxurllib.parse.unquote() (Py 3)/ urllib.unquote() (Py 2)不支持的非标准URL编码方案

It was only ever part of ECMAScript ECMA-262 3rd edition; 它只是ECMAScript ECMA-262第3版的一部分; the format was rejected by the W3C and was never a part of an RFC. 格式被W3C拒绝,并且从未成为RFC的一部分。

You could use a regular expression to convert such codepoints: 您可以使用正则表达式来转换此类代码点:

try:
    unichr  # only in Python 2
except NameError:
    unichr = chr  # Python 3

re.sub(r'%u([a-fA-F0-9]{4}|[a-fA-F0-9]{2})', lambda m: unichr(int(m.group(1), 16)), quoted)

This decodes both the %uxxxx and the %uxx form ECMAScript 3rd ed can decode. 这解码了%uxxxx%uxx形式ECMAScript 3rd ed可以解码。

Demo: 演示:

>>> import re
>>> quoted = '%u003c%u0062%u0072%u003e'
>>> re.sub(r'%u([a-fA-F0-9]{4}|[a-fA-F0-9]{2})', lambda m: chr(int(m.group(1), 16)), quoted)
'<br>'
>>> altquoted = '%u3c%u0062%u0072%u3e'
>>> re.sub(r'%u([a-fA-F0-9]{4}|[a-fA-F0-9]{2})', lambda m: chr(int(m.group(1), 16)), altquoted)
'<br>'

but you should avoid using the encoding altogether if possible. 但是如果可能的话,你应该完全避免使用编码。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM