简体   繁体   English

使用Python和诱变剂进行脱模烘焙

[英]De-mojibaking with Python and mutagen

I'm reading mojibaked ID3 tags with mutagen . 我正在阅读含有mutagen mojibaked ID3标签。 My goal is to fix the mojibake while learning about encodings and Python's handling thereof. 我的目标是在学习mojibake的同时学习编码和Python的处理方法。

The file I'm working with has an ID3v2 tag, and I'm looking at its album ( TALB ) frame, which is, according to the encoding byte in the TALB ID3 frame, encoded in Latin-1 ( ISO-8859-1 ). 我正在使用的文件具有ID3v2标签,我正在查看其专辑( TALB )框架,根据TALB ID3框架中的编码字节,该TALB以Latin-1( ISO-8859-1 )。 I know that the bytes in this frame, however, are encoded in cp1251 (Cyrillic). 我知道该帧中的字节是用cp1251 (西里尔字母)编码的。

Here's my code so far: 到目前为止,这是我的代码:

 >>> from mutagen.mp3 import MP3
 >>> mp3 = MP3(paths[0])
 >>> mp3['TALB']
 TALB(encoding=0, text=[u'\xc1\xf3\xf0\xe6\xf3\xe9\xf1\xea\xe8\xe5 \xef\xeb\xff\xf1\xea\xe8'])

Now, as you can see, mp3['TALB'].text[0] is represented here as a Unicode string. 现在,如您所见, mp3['TALB'].text[0]在这里表示为Unicode字符串。 However, it's mojibaked: 但是,它是mojibaked:

 >>> print mp3['TALB'].text[0]
 Áóðæóéñêèå ïëÿñêè

I am having very little luck at transcoding these cp1251 bytes into their correct Unicode codepoints. 我很难将这些cp1251字节转码为正确的Unicode代码点。 My best results so far have been very unbecoming: 到目前为止,我最好的成绩一直很糟糕:

>>> st = ''.join([chr(ord(x)) for x in mp3['TALB'].text[0]]); st
'\xc1\xf3\xf0\xe6\xf3\xe9\xf1\xea\xe8\xe5 \xef\xeb\xff\xf1\xea\xe8'
>>> print st.decode('cp1251')
Буржуйские пляски <-- **this is the correct, demojibaked text!**

As I understand this approach, it works because I end up transforming the Unicode string into an 8-bit string, which I can then decode into Unicode, while specifying the encoding I am decoding from. 据我了解,这种方法之所以有效,是因为我最终将Unicode字符串转换为8位字符串,然后可以在指定要从中解码的编码时将其解码为Unicode。

The problem is that I can't decode('cp1251') on the Unicode string directly: 问题是我无法直接在Unicode字符串上decode('cp1251')

>>> st = mp3['TALB'].text[0]; st
u'\xc1\xf3\xf0\xe6\xf3\xe9\xf1\xea\xe8\xe5 \xef\xeb\xff\xf1\xea\xe8'
>>> print st.decode('cp1251')
Traceback (most recent call last):
  File "<console>", line 1, in <module>
  File "/Users/dmitry/dev/mp3_tag_encode_convert/lib/python2.7/encodings/cp1251.py", line 15, in decode
    return codecs.charmap_decode(input,errors,decoding_table)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-9: ordinal not in range(128)

Can someone explain this? 有人可以解释吗? I can't understand how to make it not decode into the 7-bit ascii range when operating directly on the u'' string. 当直接在u''字符串上操作时,我不明白如何使其不解码到7位ascii范围内。

First, encode it in the encoding that you know it is already in. 首先,以您已经知道的编码方式对其进行编码。

>>> tag = u'\xc1\xf3\xf0\xe6\xf3\xe9\xf1\xea\xe8\xe5 \xef\xeb\xff\xf1\xea\xe8'
>>> raw = tag.encode('latin-1'); raw
'\xc1\xf3\xf0\xe6\xf3\xe9\xf1\xea\xe8\xe5 \xef\xeb\xff\xf1\xea\xe8'

Then you can decode it in the proper encoding. 然后,您可以使用正确的编码对其进行解码。

>>> fixed = raw.decode('cp1251'); print fixed
Буржуйские пляски

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM