简体   繁体   中英

Python encoding problems in xml

I have a media player that I would like to send what I'm playing to trakt.tv, everything works fine except for foreign letters in the title/path. The system is running python 2.7.3

def getStatus(self,ip,timeout=10.0):
    oPchStatus = PchStatus()
    try:
        oResponse = urlopen("http://" + ip + ":8008/playback?arg0=get_current_vod_info",None,timeout)
        oPchStatus = self.parseResponse(oResponse.readlines()[0])
    return oPchStatus

This will return some thing like this.

<?xml version="1.0"?>
<theDavidBox>
  <request>
    <arg0>get_current_vod_info</arg0>
    <module>playback</module>
  </request>
  <response>
    <currentStatus>pause</currentStatus>
    <currentTime>3190</currentTime>
    <downloadSpeed>0</downloadSpeed>
    <fullPath>/opt/sybhttpd/localhost.drives/HARD_DISK/Storage/NAS/Videos/FILMS/A.Haunted.House.(2013)/A Haunted House.avi</fullPath>
    <lastPacketTime>0</lastPacketTime>
    <mediatype>OTHERS</mediatype>
    <seekEnable>true</seekEnable>
    <title/>
    <totalTime>4860</totalTime>
  </response>
  <returnValue>0</returnValue>
</theDavidBox>

The next step takes the above and assigns each item to a variable.

class PchStatus:
    def __init__(self):
        self.status=EnumStatus.NOPLAY
        self.fullPath = u""
        self.fileName = u""
        self.currentTime = 0
        self.totalTime = 0
        self.percent = 0
        self.mediaType = ""
        self.currentChapter = 0 # For Blu-ray Disc only
        self.totalChapter = 0 # For Blu-ray Disc only
        self.error = None

class PchRequestor:

    def parseResponse(self, response):
        oPchStatus = PchStatus()
        try:
            response = unescape(response)
            oXml = ElementTree.XML(response)
            if oXml.tag == "theDavidBox": # theDavidBox should be the root
                if oXml.find("returnValue").text == '0' and int(oXml.find("response/totalTime").text) > 90:#Added total time check to avoid scrobble while playing adverts/trailers
                    oPchStatus.totalTime = int(oXml.find("response/totalTime").text)
                    oPchStatus.status = oXml.find("response/currentStatus").text
                    oPchStatus.fullPath = oXml.find("response/fullPath").text
                    oPchStatus.currentTime = int(oXml.find("response/currentTime").text)

and so on. Using the above returned xml,

oPchStatus.totalTime would be "4860" oPchStatus.status would be "pause" oPchStatus.fullPath would be "/opt/sybhttpd/localhost.drives/HARD_DISK/Storage/NAS/Videos/FILMS/A.Haunted.House.(2013)/A Haunted House.avi" oPchStatus.currentTime would be "3190"

This, like I said this works well until a foreign letter is in the title. A title like Le.Fabuleux.Destin.d'Amélie.Poulain.(2001).avi will make oPchStatus.fullPath contain the string "/opt/sybhttpd/localhost.drives/HARD_DISK/Storage/NAS/Videos/Le.Fabuleux.Destin.d'Am\\xe9lie.Poulain.(2001).avi"

and not

"/opt/sybhttpd/localhost.drives/HARD_DISK/Storage/NAS/Videos/Le.Fabuleux.Destin.d'Amélie.Poulain.(2001).avi"

as i want it to be.

Further on in the script there are routines to scan xml files for the file name and also to create FILENAME.watched so i need to file names to match the actual file name and not replace any letters.

What would be the best way to ensure these types of file names are encoded properly? I have tried to provide as much info as possable but if you need more info please just ask.

Python is merely keeping your string value printable in ASCII, by showing you the escape code for the é character, \\xe9 .

Some notes on your linked source code:

  • you should not turn the response you want to parse into unicode. Parse the raw bytes instead . Parsers expect to decode the contents themselves. In fact, the ElementTree parser will encode the data again just to be able to parse it.

  • When you have XML in a bytestring, I'd use the ElementTree.fromstring() function instead; yes, underneath it uses ElementTree.XML() like you do, but fromstring() is the documented API .

Otherwise, your example input is working exactly as it should do . If I create a XML document from your example with non-ASCII characters in the filepath, I get the following:

>>> tree = ElementTree.fromstring(response)
>>> print tree.find("response/fullPath").text
/opt/sybhttpd/localhost.drives/HARD_DISK/Storage/NAS/Videos/Le.Fabuleux.Destin.d'Amélie.Poulain.(2001).avi
>>> tree.find("response/fullPath").text
u"/opt/sybhttpd/localhost.drives/HARD_DISK/Storage/NAS/Videos/Le.Fabuleux.Destin.d'Am\xe9lie.Poulain.(2001).avi"

As you can see, the unicode() results from .text contains a é character (Unicode codepoint U+00E9, LATIN SMALL LETTER E WITH ACUTE). When printed as a Python literal, Python makes sure it'll be printable in a ASCII context by giving me the Python escape code for that codepoint, \\xe9 . This is normal , nothing is broken.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM