简体   繁体   中英

Decoding UTF-8 to URL with Python

I have the following url encoded in utf-8.

url_input = u'https://www.gumtree.com//p/uk-holiday-rentals/1bedroon-flat-\xa3250pw-all-bills-included-/1174092955'

I need to scrap this webpage and to do so I need to have the following url_output (unicode is not read).

url_output=https://www.gumtree.com//p/uk-holiday-rentals/1bedroon-flat-£250pw-all-bills-included-/1174092955

When I print url_input, I get url_output:

print(url_input)
https://www.gumtree.com//p/uk-holiday-rentals/1bedroon-flat-£250pw-all-bills-included-/1174092955

However I do not find a way to transform url_input to url_output. According to forums the print function uses ascii decoding on Python 2.7 but ascii is not supposed to read \\xa3 and url_input.encode('ASCII') does not work.

Does someone know how I can solve this problem ? Thanks in advance !

When you print url_input you get the desired url_output only because your terminal understand UTF-8 and can represents \\xa3 correctly.

You can encode the string in ASCII with str.encode , but you have to replace (with a ? ) or ignore the chars that does not are ascii:

url_output = url_input.encode("ascii", "replace")
print(url_output)

will prints:

https://www.gumtree.com//p/uk-holiday-rentals/1bedroon-flat-?250pw-all-bills-included-/1174092955

and

url_output = url_input.encode("ascii", "ignore")
print(url_output)

will prints:

https://www.gumtree.com//p/uk-holiday-rentals/1bedroon-flat-250pw-all-bills-included-/1174092955

You can not obtain an output string with a £ ascii character because the value of this character is greater than 127.

After some tests, I can confirm that the server accepts the URL in different formats:

  • raw utf8 encoded URL:

     url_output = url_input.encode('utf8') 
  • %encoded latin1 URL

     url_output = urllib.quote_plus(url_input.encode('latin1'), '/:') 
  • %encoded utf8 URL

     url_output = urllib.quote_plus(url_input.encode('utf8'), '/:') 

As the raw latin1 in not accepted and leads to an incorrect URL error, and as passing non ascii characters in an URL may not be safe, my advice is to use this third way. It gives:

    print url_output

    https://www.gumtree.com//p/uk-holiday-rentals/1bedroon-flat-%C2%A3250pw-all-bills-included-/1174092955

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM