简体   繁体   中英

Mutagen 1.22 Encoding Issue

I am having an issue with character encoding with Mutagen.

I casted the dict[key] to Unicode, bu all I receive are errors. The character in question is U+00E9 or é , but what I prints is ├⌐ . I am assuming the default character set for Mutagen is UTF-8, but is there a way to fix this?

Output:

Winter Wonderland.mp3
Album       : Christmas
Album Artist: Michael Bublé
Artist      : Michael Bublé
Composer    : None
Disk        : None
Encoded By  : None
Genre       : Christmas
Title       : Winter Wonderland
Track       : 17/19
Year        : 2011

Code:

#!/usr/bin/env python

import os
import re
from mutagen.mp3 import MP3

first_cap_re = re.compile('(.)([A-Z][a-z]+)')
all_cap_re = re.compile('([a-z0-9])([A-Z])')
def convertCamelCase2Underscore(name):
    s1 = first_cap_re.sub(r'\1_\2', name)
    return all_cap_re.sub(r'\1_\2', s1).lower()

def convertCamelCase2CapitalizedWords(name):
    return ' '.join([x.capitalize() for x in convertCamelCase2Underscore(name).split('_')])

def safeValue(dict, key):
    return None if key not in dict else dict[key]

class Track:
    def __init__(self, path):
        audio = MP3(path)
        self.title = safeValue(audio, 'TIT2')
        self.artist = safeValue(audio, 'TPE1')
        self.albumArtist = safeValue(audio, 'TPE2')
        self.album = safeValue(audio, 'TALB')
        self.genre = safeValue(audio, 'TCON')
        self.year = safeValue(audio, 'TDRL')
        self.encodedBy = safeValue(audio, 'TENC')
        self.composer = safeValue(audio, 'TXXX:TCM')
        self.track = safeValue(audio, 'TRCK')
        self.disk = safeValue(audio, 'TXXX:TPA')
    def __repr__(self):
        ret = ''
        fields = self.__dict__

        for k, v in sorted(self.__dict__.iteritems()):
            ret += '{:12s}: {:s}\n'.format(convertCamelCase2CapitalizedWords(k), v)
        return ret

files = os.listdir('.')

for filename in files:
    print filename
    print Track(filename)

I am assuming the default character set for Mutagen is UTF-8

Mutagen returns Unicode strings, though wrapped in a TextFrame object. When you print that object it's an implicit str() conversion of the text property to bytes, and Mutagen (arbitrarily) chooses UTF-8 for that encoding.

Unfortunately the Windows console doesn't support UTF-8[1]. The encoding it uses varies but in your case you are getting the US DOS code page 437 where the byte sequence 0xC3 0xA9 represents ├⌐ and not é . You could try to print to the console in the encoding that it wants by explicitly encoding to it:

print unicode(audio['TIT2']).encode(sys.stdout.encoding)  # 'cp437'

but this will still only allow you to print characters that are supported in that code page. 437 is OK for Michael Bublé, but not so good for 東京事変. There isn't a good way to get Unicode out to the Windows console.[2]

[1] There is code page 65001 which is supposed to be UTF-8, but there are bugs in the MS implementation which usually make it unusable.

[2] You can, if you must, call the Win32 API WriteConsoleW directly using ctypes , but then you have to take care to only do that when you are connected to a Windows console and not any other type of stream so you don't break everywhere else. It's usually not worth it; Windows users are assumed to be used to a console where non-ASCII characters just break all the time.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM