简体   繁体   English

无法打开包含西里尔字母的Python编码URL

[英]Can't open Python encoded URL containing Cyrillic symbols

I have the following URL "mysite.com/\Т\е\к\с\т \н\а \к\и\р\и\л\и\ц\а" ("mysite.com/Текст на кирилица"). 我有以下网址“ mysite.com/\Т\е\к\с\т \\ u043D \\ u0430 \\ u043A \\ u0438 \\ u0440 \\ u0438 \\ u043B \\ u0438 \\ u0446 \\ u0430”(“ mysite.com/ “)。 I want to open this URL using browser.open(link) where browser is 我想使用browser.open(link)打开此URL,其中浏览器位于

$CHandler = urllib2.HTTPCookieProcessor(cookielib.CookieJar())
browser = urllib2.build_opener(CHandler)
user_agent = '  Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.2.17) Gecko/20110420 Firefox/3.6.17'
browser.addheaders = [('User-agent', user_agent )]
urllib2.install_opener(browser)

However I get the error: 但是我得到了错误:

UnicodeEncodeError: 'ascii' codec can't encode characters in position 12-17: ordinal not in range(128)" UnicodeEncodeError:“ ascii”编解码器无法对位置12-17中的字符进行编码:序数不在范围(128)中”

I get this URL from JSON. 我从JSON获取此URL。

How can I resolve this? 我该如何解决?

mysite.com/Текст на кирилица is not a URL: mysite.com/Текст на кирилица不是URL:

  • because it has omitted the http:// (or other) schema; 因为它省略了http:// (或其他)模式;
  • it has spaces in, which aren't valid; 里面有空格,这是无效的;
  • because URIs can't contain non-ASCII characters. 因为URI不能包含非ASCII字符。 Only IRIs can, and urllib2 doesn't support them. 只有IRI可以,而urllib2不支持它们。

So you will need to fix the brokennesses, %-encoding out of band characters (like space -> %20 ), add the schema if missing, and then convert IRI to URI. 因此,您将需要修复缺陷,对带外字符(例如空格-> %20 )进行%编码,如果丢失,则添加架构,然后将IRI转换为URI。 To do this conversion you will need to encode the hostname part of the address using the IDN algorithm (Python: s.encode('idna') ), then encode any non-ASCII characters in other parts of the address using UTF-8 then %-encoding. 要进行此转换,您将需要使用IDN算法(Python: s.encode('idna') )对地址的主机名部分进行编码,然后使用UTF-8对地址的其他部分中的所有非ASCII字符进行编码,然后%编码。

What you want to end up with is: 您最终想要得到的是:

http://mysite.com/%D0%A2%D0%B5%D0%BA%D1%81%D1%82%20%D0%BD%D0%B0%20%D0%BA%D0%B8%D1%80%D0%B8%D0%BB%D0%B8%D1%86%D0%B0

which is a valid URI accepted by urllib2 , but also displays as http://mysite.com/Текст на кирилица in the browser's address bar when you follow it. 这是urllib2接受的有效URI,但在您遵循它时http://mysite.com/Текст на кирилица在浏览器的地址栏中显示为http://mysite.com/Текст на кирилица

There are lots of functions about that implement IRI-to-URI (most Python web frameworks have something like it, for one). 关于实现IRI-to-URI的功能很多(大多数Python Web框架都具有类似的功能)。 If you want to go the whole hog on correcting and normalising suspect incoming URLs, there's also this . 如果您想全力以赴地纠正和规范可疑的传入URL,还可以使用this

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM