Python kludge to read UCS-2 (UTF-16?) as ASCII

Question

I'm in a little over my head on this one, so please pardon my terminology in advance.

I'm running this using Python 2.7 on Windows XP.

I found some Python code that reads a log file, does some stuff, then displays something.

What, that's not enough detail? Ok, here's a simplified version:

#!/usr/bin/python

import re
import sys

class NotSupportedTOCError(Exception):
    pass

def filter_toc_entries(lines):
    while True:
        line = lines.next()
        if re.match(r""" \s* 
                   .+\s+ \| (?#track)
                \s+.+\s+ \| (?#start)
                \s+.+\s+ \| (?#length)
                \s+.+\s+ \| (?#start sec)
                \s+.+\s*$   (?#end sec)
                """, line, re.X):
            lines.next()
            break

    while True:
        line = lines.next()
        m = re.match(r"""
            ^\s*
            (?P<num>\d+)
            \s*\|\s*
            (?P<start_time>[0-9:.]+)
            \s*\|\s*
            (?P<length_time>[0-9:.]+)
            \s*\|\s*
            (?P<start_sector>\d+)
            \s*\|\s*
            (?P<end_sector>\d+)
            \s*$
            """, line, re.X)
        if not m:
            break
        yield m.groupdict()

def calculate_mb_toc_numbers(eac_entries):
    eac = list(eac_entries)
    num_tracks = len(eac)

    tracknums = [int(e['num']) for e in eac]
    if range(1,num_tracks+1) != tracknums:
        raise NotSupportedTOCError("Non-standard track number sequence: %s", tracknums)

    leadout_offset = int(eac[-1]['end_sector']) + 150 + 1
    offsets = [(int(x['start_sector']) + 150) for x in eac]
    return [1, num_tracks, leadout_offset] + offsets

f = open(sys.argv[1])

mb_toc_urlpart = "%20".join(str(x) for x in calculate_mb_toc_numbers(filter_toc_entries(f)))

print mb_toc_urlpart

The code works fine as long as the log file is "simple" text (I'm tempted to say ASCII although that may not be precise/accurate - for eg Notepad++ indicates it's ANSI).

However, the script doesn't work on certain log files (in these cases, Notepad++ says "UCS-2 Little Endian").

I get the following error:

Traceback (most recent call last):
  File "simple.py", line 55, in <module>
    mb_toc_urlpart = "%20".join(str(x) for x in calculate_mb_toc_numbers(filter_
toc_entries(f)))
  File "simple.py", line 49, in calculate_mb_toc_numbers
    leadout_offset = int(eac[-1]['end_sector']) + 150 + 1
IndexError: list index out of range

This log works

This log breaks

I believe it's the encoding that's breaking the script because if I simply do this at a command prompt:

type ascii.log > scrubbed.log

and then run the script on scrubbed.log, the script works fine (this is actually fine for my purposes since there's no loss of important information and I'm not writing back to a file, just printing to the console).

One workaround would be to "scrub" the log file before passing it to Python (eg using the type pipe trick above to a temporary file and then have the script run on that), but I would like to have Python "ignore" the encoding if it's possible. I'm also not sure how to detect what type of log file the script is reading so I can act appropriately.

I'm reading this and this but my eyes are still spinning around in their head, so while that may be my longer term strategy, I'm wondering if there's an interim hack I could use.

Answer 1

codecs.open() will allow you to open a file using a specific encoding, and it will produce unicode s. You can try a few, going from most likely to least likely (or the tool could just always produce UTF-16LE but ha ha fat chance).

Also, "Unicode In Python, Completely Demystified" .

Answer 2

works.log appears to be encoded in ASCII:

>>> data = open('works.log', 'rb').read()
>>> all(d < '\x80' for d in data)
True

breaks.log appears to be encoded in UTF-16LE -- it starts with the 2 bytes '\\xff\\xfe' . None of the characters in breaks.log are outside the ASCII range:

>>> data = open('breaks.log', 'rb').read()
>>> data[:2]
'\xff\xfe'
>>> udata = data.decode('utf16')
>>> all(d < u'\x80' for d in udata)
True

If these are the only two possibilities, you should be able to get away with the following hack. Change your mainline code from:

f = open(sys.argv[1])
mb_toc_urlpart = "%20".join(
    str(x) for x in calculate_mb_toc_numbers(filter_toc_entries(f)))
print mb_toc_urlpart

to this:

f = open(sys.argv[1], 'rb')
data = f.read()
f.close()
if data[:2] == '\xff\xfe':
    data = data.decode('utf16').encode('ascii')
# ilines is a generator which produces newline-terminated strings
ilines = (line + '\n' for line in data.splitlines())
mb_toc_urlpart = "%20".join(
    str(x) for x in calculate_mb_toc_numbers(filter_toc_entries(ilines))        )
print mb_toc_urlpart

Answer 3

Python 2.x expects normal strings to be ASCII (or at least one byte). Try this:

Put this at the top of your Python source file:

from __future__ import unicode_literals

And change all the str to unicode .

[edit]

And as Ignacio Vazquez-Abrams wrote, try codecs.open() to open the input file.

Python kludge to read UCS-2 (UTF-16?) as ASCII

Question

3 answers

solution1
6 2011-03-09 05:55:41

solution2
3 ACCPTED 2011-03-09 09:22:39

solution3
0 2011-03-09 05:57:11

Python kludge to read UCS-2 (UTF-16?) as ASCII

Question

3 answers

solution1 6 2011-03-09 05:55:41

solution2 3 ACCPTED 2011-03-09 09:22:39

solution3 0 2011-03-09 05:57:11

solution1
6 2011-03-09 05:55:41

solution2
3 ACCPTED 2011-03-09 09:22:39

solution3
0 2011-03-09 05:57:11