简体   繁体   English

用Python解码UTF-8 URL

[英]Decoding UTF-8 URL in Python

I have a string like "pe%20to%C5%A3i%20mai". 我有一个类似“ pe%20to%C5%A3i%20mai”的字符串。 When I apply urllib.parse.unquote to it, I get "pe to\ţi mai". 当我将urllib.parse.unquote应用于它时,我得到“ pe to \\ u016​​3i mai”。 If I try to write this to a file, I get those exact simbols, not the expected glyph. 如果我尝试将其写入文件,则会得到这些确切的符号,而不是预期的字形。

How can I transform the string to utf-8 so in the file I have the proper glyph instead? 如何将字符串转换为utf-8,以便在文件中使用正确的字形?

Edit: I'm using Python 3.2 编辑:我正在使用Python 3.2

Edit2: So I figured out that the urllib.parse.unquote was working correctly, and my problem actually is that I'm serializing to YAML with yaml.dump and that seems to screw things up. Edit2:所以我发现urllib.parse.unquote正常工作,而我的问题实际上是我正在使用yaml.dump序列化到YAML,这似乎搞砸了。 Why? 为什么?

Update : If the output file is a yaml document then you could ignore in it. 更新 :如果输出文件是文档,则可以忽略其中的 Unicode escapes are valid in yaml documents. Unicode转义在yaml文档中有效。

#!/usr/bin/env python3
import json

# json produces a subset of yaml
print(json.dumps('pe toţi mai')) # -> "pe to\u0163i mai"
print(json.dumps('pe toţi mai', ensure_ascii=False)) # -> "pe toţi mai"

Note: no \\u\u003c/code> in the last case. 注意:在最后一种情况下,没有\\u\u003c/code> 。 Both lines represent the same Python string. 这两行代表相同的Python字符串。

yaml.dump() has similar option: allow_unicode . yaml.dump()具有类似的选项: allow_unicode Set it to True to avoid Unicode escapes. 将其设置为True以避免Unicode转义。


The url is correct. 网址正确。 You don't need to do anything with it: 您不需要做任何事情:

#!/usr/bin/env python3
from urllib.parse import unquote

url =  "pe%20to%C5%A3i%20mai"
text = unquote(url)

with open('some_file', 'w', encoding='utf-8') as file:
    def p(line):
        print(line, file=file) # write line to file

    p(text)                # -> pe toţi mai
    p(repr(text))          # -> 'pe toţi mai'
    p(ascii(text))         # -> 'pe to\u0163i mai'

    p("pe to\u0163i mai")  # -> pe toţi mai
    p(r"pe to\u0163i mai") # -> pe to\u0163i mai
    #NOTE: r'' prefix

The sequence might be introduced by character encoding error handler: 字符编码错误处理程序可能会引入序列:

with open('some_other_file', 'wb') as file: # write bytes
    file.write(text.encode('ascii', 'backslashreplace')) # -> pe to\u0163i mai

Or: 要么:

with open('another', 'w', encoding='ascii', errors='backslashreplace') as file:
    file.write(text) # -> pe to\u0163i mai

More examples: 更多示例:

# introduce some more \u escapes
b = r"pe to\u0163i mai ţţţ".encode('ascii', 'backslashreplace') # bytes
print(b.decode('ascii')) # -> pe to\u0163i mai \u0163\u0163\u0163
# remove unicode escapes
print(b.decode('unicode-escape')) # -> pe toţi mai ţţţ

Python 3 Python 3

Calling urllib.parse.unquote returns a Unicode string already: 调用urllib.parse.unquote已经返回一个Unicode字符串:

>>> urllib.parse.unquote("pe%20to%C5%A3i%20mai")
'pe toţi mai'

If you don't get that result, it must be an error in your code. 如果没有得到该结果,则一定是代码错误。 Please post your code. 请发布您的代码。

Python 2 Python 2

Use decode to get a Unicode string from a bytestring: 使用decode从字节字符串获取Unicode字符串:

>>> import urllib2
>>> print urllib2.unquote("pe%20to%C5%A3i%20mai").decode('utf-8')
pe toţi mai

Remember that when you write a Unicode string to a file you have to encode it again. 请记住,将Unicode字符串写入文件时,必须再次对其进行编码。 You could choose to write to the file as UTF-8, but you could also choose a different encoding if you wished. 您可以选择以UTF-8格式写入文件,但也可以根据需要选择其他编码。 You also have to remember to use the same encoding when reading back from the file. 从文件读回时,还必须记住使用相同的编码。 You may find the codecs module useful for specifying an encoding when reading from and writing to files. 您可能会发现codecs模块对于在读取和写入文件时指定编码很有用。

>>> import urllib2, codecs
>>> s = urllib2.unquote("pe%20to%C5%A3i%20mai").decode('utf-8')

>>> # Write the string to a file.
>>> with codecs.open('test.txt', 'w', 'utf-8') as f:
...     f.write(s)

>>> # Read the string back from the file.
>>> with codecs.open('test.txt', 'r', 'utf-8') as f:
...     s2 = f.read()

One potentially confusing issue is that in the interactive interpreter Unicode strings are sometimes displayed using the \\uxxxx notation instead of the actual characters: 一个可能引起混淆的问题是,在交互式解释器中,有时使用\\uxxxx表示法而不是实际字符来显示Unicode字符串:

>>> s
u'pe to\u0163i mai'
>>> print s
pe toţi mai

This does not mean that the string is "wrong". 这并不意味着字符串是“错误的”。 It's just the way the interpreter works. 这就是解释器工作的方式。

Try decode using unicode_escape . 尝试使用unicode_escape decode

Eg: 例如:

>>> print "pe to\u0163i mai".decode('unicode_escape')
pe toţi mai

The urllib.parse.unquote returned a correct UTF-8 string and writing that straight to the file returned did the expected result. urllib.parse.unquote返回了正确的UTF-8字符串,并将其直接写入返回的文件即可达到预期的结果。 The problem was with yaml. 问题出在yaml上。 By default it doesn't encode with UTF-8. 默认情况下,它不使用UTF-8编码。

My solution was to do: 我的解决方案是:

yaml.dump("pe%20to%C5%A3i%20mai",encoding="utf-8").decode("unicode-escape")

Thanks to JF Sebastian and Mark Byers for asking me the right questions that helped me figure out the problem! 感谢JF Sebastian和Mark Byers向我提出了正确的问题,这些问题可以帮助我解决问题!

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM