简体   繁体   English

如何规范化Python字符串编码

[英]How to normalize Python string encodings

I have a text file with strings. 我有一个带有字符串的文本文件。 These strings ultimately represent URL paths (not full URLs), but have been encoded in several ways. 这些字符串最终代表URL路径(不是完整的URL),但是已经以几种方式进行了编码。 Here is an excerpt of the file: 这是该文件的摘录:

25_%D1%80%D0%B0%D1%88%D3%99%D0%B0%D1%80%D0%B0
2_\xD1\x80\xD0\xB0\xD1\x88\xD3\x99\xD0\xB0\xD1\x80\xD0\xB0
5_%D1%80%D0%B0%D1%88%D3%99%D0%B0%D1%80%D0%B0
\xD0\x90\xD0\xBA\xD0\xB0\xD0\xB1\xD0\xB0
\xD0\x90\xD1\x88\xD3\x99\xD0\xB0\xD1\x85\xD1\x8C\xD0\xB0
function.fopen
Бразилиа
Валерии_Маиромиан
Rome,_Italy
Rome%2C_Italy

I would like to guarantee a common format for all these strings, as after loading the file I will need to do string comparisons (eg Rome%2C_Italy should equal Rome,_Italy ). 我想保证所有这些字符串的通用格式,因为在加载文件后,我需要进行字符串比较(例如Rome%2C_Italy应该等于Rome,_Italy )。

Some lines are URL encoded, and can be easily unquoted : 有些行是经过URL编码的,可以很容易地取消unquoted

import urllib
with open("input.txt") as f:
    for line in f:
        str = urllib.unquote(line.rstrip())
        print str

The output of the previous code is: 先前代码的输出是:

25_рашәара
2_\xD1\x80\xD0\xB0\xD1\x88\xD3\x99\xD0\xB0\xD1\x80\xD0\xB0
5_рашәара
\xD0\x90\xD0\xBA\xD0\xB0\xD0\xB1\xD0\xB0
\xD0\x90\xD1\x88\xD3\x99\xD0\xB0\xD1\x85\xD1\x8C\xD0\xB0
function.fopen
Бразилиа
Валерии_Маиромиан
Rome,_Italy
Rome,_Italy

My best attempt is the following code: 我最好的尝试是下面的代码:

import urllib
with open("input.txt") as f:
    for line in f:
        str = urllib.unquote(line.rstrip()).encode("utf8")
        print str

with the following output: 具有以下输出:

25_рашәара
2_\xD1\x80\xD0\xB0\xD1\x88\xD3\x99\xD0\xB0\xD1\x80\xD0\xB0
5_рашәара
\xD0\x90\xD0\xBA\xD0\xB0\xD0\xB1\xD0\xB0
\xD0\x90\xD1\x88\xD3\x99\xD0\xB0\xD1\x85\xD1\x8C\xD0\xB0
function.fopen
Бразилиа
Валерии_Маиромиан
Rome,_Italy
Rome,_Italy

It seems to have ignored some lines! 似乎已经忽略了一些行!

In any case, I believe it would be preferrable to simply URL-encode all these strings (as with line 1 ), but the urllib.quote() method will not work well on the lines that are already URL-encoded (it will encode % again!). 无论如何,我认为最好对所有这些字符串进行URL编码(与第1行一样),但是urllib.quote()方法在已经进行URL编码的行上将不能很好地工作(它将进行编码)。 % !)。

Any help clearing up my confusion is appreciated! 感谢您为消除我的困惑提供的任何帮助!

This code uses a similar approach to Eugene Lisitsky except that it runs on Python 2. There may be a neater way to do this in Python 2, but it appears to work correctly on the data in the OP. 该代码使用与Eugene Lisitsky类似的方法,除了它在Python 2上运行。在Python 2中可能有更整齐的方法来执行此操作,但是它似乎可以在OP中的数据上正常工作。

BTW, you should tag your question with an appropriate Python version tag when you ask a question relating to Unicode, since Unicode handling in Python 3 is quite different to how it works (or fails to do so :) ) in Python 2. 顺便说一句,当您询问与Unicode有关的问题时,应该使用适当的Python版本标签来标记您的问题,因为Python 3中的Unicode处理与Python 2中的工作方式(或失败)完全不同。

import codecs
import urllib

fname = 'input.txt'

with open(fname, 'rb') as f:
    for line in f:
        line = line.strip()
        line = urllib.unquote(line)
        if r'\x' in line:
            line = codecs.unicode_escape_decode(line)[0]
            line = line.encode('latin1')

        line = line.decode('utf-8')
        print repr(line), line

output 产量

u'25_\u0440\u0430\u0448\u04d9\u0430\u0440\u0430' 25_рашәара
u'2_\u0440\u0430\u0448\u04d9\u0430\u0440\u0430' 2_рашәара
u'5_\u0440\u0430\u0448\u04d9\u0430\u0440\u0430' 5_рашәара
u'\u0410\u043a\u0430\u0431\u0430' Акаба
u'\u0410\u0448\u04d9\u0430\u0445\u044c\u0430' Ашәахьа
u'function.fopen' function.fopen
u'\u0411\u0440\u0430\u0437\u0438\u043b\u0438\u0430' Бразилиа
u'\u0412\u0430\u043b\u0435\u0440\u0438\u0438_\u041c\u0430\u0438\u0440\u043e\u043c\u0438\u0430\u043d' Валерии_Маиромиан
u'Rome,_Italy' Rome,_Italy
u'Rome,_Italy' Rome,_Italy

As you can see, I've converted all the strings to Unicode objects. 如您所见,我已经将所有字符串都转换为Unicode对象。 If for some reason you want them as plain Python 2 strings just eliminate the line = line.decode('utf-8') line. 如果出于某种原因希望它们作为纯Python 2字符串,则只需消除line = line.decode('utf-8')行。

You may use codecs.unicode_escape_decode to decode backslash-escaped characters like so: 您可以使用codecs.unicode_escape_decode解码反斜杠转义的字符,如下所示:

>>> import codecs
>>> s=r"\xD0\x90\xD0\xBA\xD0\xB0\xD0\xB1\xD0\xB0"
>>> print(s)
\xD0\x90\xD0\xBA\xD0\xB0\xD0\xB1\xD0\xB0
>>> s1=codecs.unicode_escape_decode(s)[0]
>>> print(s1)
Ðкаба
>>> bytes(s1,'latin1').decode('utf-8')
'Акаба'
>>>

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM