简体   繁体   English

使用Python将UTF-8解码为URL

[英]Decoding UTF-8 to URL with Python

I have the following url encoded in utf-8. 我将以下url编码为utf-8。

url_input = u'https://www.gumtree.com//p/uk-holiday-rentals/1bedroon-flat-\xa3250pw-all-bills-included-/1174092955'

I need to scrap this webpage and to do so I need to have the following url_output (unicode is not read). 我需要抓取此网页,并且这样做,我需要具有以下url_output(不读取unicode)。

url_output=https://www.gumtree.com//p/uk-holiday-rentals/1bedroon-flat-£250pw-all-bills-included-/1174092955

When I print url_input, I get url_output: 当我打印url_input时,我得到url_output:

print(url_input)
https://www.gumtree.com//p/uk-holiday-rentals/1bedroon-flat-£250pw-all-bills-included-/1174092955

However I do not find a way to transform url_input to url_output. 但是,我找不到将url_input转换为url_output的方法。 According to forums the print function uses ascii decoding on Python 2.7 but ascii is not supposed to read \\xa3 and url_input.encode('ASCII') does not work. 根据论坛的说法,打印功能在Python 2.7上使用了ascii解码,但是ascii不应读取\\xa3并且url_input.encode('ASCII')不起作用。

Does someone know how I can solve this problem ? 有人知道我可以解决这个问题吗? Thanks in advance ! 提前致谢 !

When you print url_input you get the desired url_output only because your terminal understand UTF-8 and can represents \\xa3 correctly. 当您打印url_input ,仅由于您的终端可以理解UTF-8并可以正确表示\\xa3url_input您可以获得所需的url_output

You can encode the string in ASCII with str.encode , but you have to replace (with a ? ) or ignore the chars that does not are ascii: 您可以使用str.encode将字符串编码为ASCII,但是必须替换(用? )或忽略不是ascii的字符:

url_output = url_input.encode("ascii", "replace")
print(url_output)

will prints: 将打印:

https://www.gumtree.com//p/uk-holiday-rentals/1bedroon-flat-?250pw-all-bills-included-/1174092955

and

url_output = url_input.encode("ascii", "ignore")
print(url_output)

will prints: 将打印:

https://www.gumtree.com//p/uk-holiday-rentals/1bedroon-flat-250pw-all-bills-included-/1174092955

You can not obtain an output string with a £ ascii character because the value of this character is greater than 127. 您无法获得带有£ ascii字符的输出字符串,因为此字符的值大于127。

After some tests, I can confirm that the server accepts the URL in different formats: 经过一些测试,我可以确认服务器接受不同格式的URL:

  • raw utf8 encoded URL: 原始utf8编码的URL:

     url_output = url_input.encode('utf8') 
  • %encoded latin1 URL %编码的latin1 URL

     url_output = urllib.quote_plus(url_input.encode('latin1'), '/:') 
  • %encoded utf8 URL %编码的utf8网址

     url_output = urllib.quote_plus(url_input.encode('utf8'), '/:') 

As the raw latin1 in not accepted and leads to an incorrect URL error, and as passing non ascii characters in an URL may not be safe, my advice is to use this third way. 由于原始的latin1不被接受并导致错误的URL错误,并且由于在URL中传递非ascii字符可能并不安全,因此我的建议是使用第三种方式。 It gives: 它给:

    print url_output

    https://www.gumtree.com//p/uk-holiday-rentals/1bedroon-flat-%C2%A3250pw-all-bills-included-/1174092955

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM