[英]Can't open Python encoded URL containing Cyrillic symbols
I have the following URL "mysite.com/\Т\е\к\с\т \н\а \к\и\р\и\л\и\ц\а" ("mysite.com/Текст на кирилица"). 我有以下网址“ mysite.com/\Т\е\к\с\т \\ u043D \\ u0430 \\ u043A \\ u0438 \\ u0440 \\ u0438 \\ u043B \\ u0438 \\ u0446 \\ u0430”(“ mysite.com/ “)。 I want to open this URL using browser.open(link) where browser is
我想使用browser.open(link)打开此URL,其中浏览器位于
$CHandler = urllib2.HTTPCookieProcessor(cookielib.CookieJar())
browser = urllib2.build_opener(CHandler)
user_agent = ' Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.2.17) Gecko/20110420 Firefox/3.6.17'
browser.addheaders = [('User-agent', user_agent )]
urllib2.install_opener(browser)
However I get the error: 但是我得到了错误:
UnicodeEncodeError: 'ascii' codec can't encode characters in position 12-17: ordinal not in range(128)"
UnicodeEncodeError:“ ascii”编解码器无法对位置12-17中的字符进行编码:序数不在范围(128)中”
I get this URL from JSON. 我从JSON获取此URL。
How can I resolve this? 我该如何解决?
mysite.com/Текст на кирилица
is not a URL: mysite.com/Текст на кирилица
不是URL:
http://
(or other) schema; http://
(或其他)模式; urllib2
doesn't support them. urllib2
不支持它们。 So you will need to fix the brokennesses, %-encoding out of band characters (like space -> %20
), add the schema if missing, and then convert IRI to URI. 因此,您将需要修复缺陷,对带外字符(例如空格->
%20
)进行%编码,如果丢失,则添加架构,然后将IRI转换为URI。 To do this conversion you will need to encode the hostname part of the address using the IDN algorithm (Python: s.encode('idna')
), then encode any non-ASCII characters in other parts of the address using UTF-8 then %-encoding. 要进行此转换,您将需要使用IDN算法(Python:
s.encode('idna')
)对地址的主机名部分进行编码,然后使用UTF-8对地址的其他部分中的所有非ASCII字符进行编码,然后%编码。
What you want to end up with is: 您最终想要得到的是:
http://mysite.com/%D0%A2%D0%B5%D0%BA%D1%81%D1%82%20%D0%BD%D0%B0%20%D0%BA%D0%B8%D1%80%D0%B8%D0%BB%D0%B8%D1%86%D0%B0
which is a valid URI accepted by urllib2
, but also displays as http://mysite.com/Текст на кирилица
in the browser's address bar when you follow it. 这是
urllib2
接受的有效URI,但在您遵循它时http://mysite.com/Текст на кирилица
在浏览器的地址栏中显示为http://mysite.com/Текст на кирилица
。
There are lots of functions about that implement IRI-to-URI (most Python web frameworks have something like it, for one). 关于实现IRI-to-URI的功能很多(大多数Python Web框架都具有类似的功能)。 If you want to go the whole hog on correcting and normalising suspect incoming URLs, there's also this .
如果您想全力以赴地纠正和规范可疑的传入URL,还可以使用this 。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.