Python kludge將UCS-2（UTF-16？）讀取為ASCII

Question

我對這個問題有點過頭了，所以請提前原諒我的術語。

我在Windows XP上使用Python 2.7運行它。

我發現一些Python代碼讀取日志文件，做一些事情，然后顯示一些東西。

什么，這還不夠詳細？ 好的，這是一個簡化版本：

#!/usr/bin/python

import re
import sys

class NotSupportedTOCError(Exception):
    pass

def filter_toc_entries(lines):
    while True:
        line = lines.next()
        if re.match(r""" \s* 
                   .+\s+ \| (?#track)
                \s+.+\s+ \| (?#start)
                \s+.+\s+ \| (?#length)
                \s+.+\s+ \| (?#start sec)
                \s+.+\s*$   (?#end sec)
                """, line, re.X):
            lines.next()
            break

    while True:
        line = lines.next()
        m = re.match(r"""
            ^\s*
            (?P<num>\d+)
            \s*\|\s*
            (?P<start_time>[0-9:.]+)
            \s*\|\s*
            (?P<length_time>[0-9:.]+)
            \s*\|\s*
            (?P<start_sector>\d+)
            \s*\|\s*
            (?P<end_sector>\d+)
            \s*$
            """, line, re.X)
        if not m:
            break
        yield m.groupdict()

def calculate_mb_toc_numbers(eac_entries):
    eac = list(eac_entries)
    num_tracks = len(eac)

    tracknums = [int(e['num']) for e in eac]
    if range(1,num_tracks+1) != tracknums:
        raise NotSupportedTOCError("Non-standard track number sequence: %s", tracknums)

    leadout_offset = int(eac[-1]['end_sector']) + 150 + 1
    offsets = [(int(x['start_sector']) + 150) for x in eac]
    return [1, num_tracks, leadout_offset] + offsets

f = open(sys.argv[1])

mb_toc_urlpart = "%20".join(str(x) for x in calculate_mb_toc_numbers(filter_toc_entries(f)))

print mb_toc_urlpart

只要日志文件是“簡單”文本，代碼就可以正常工作（我很想說ASCII雖然可能不精確/准確 - 例如Notepad ++表示它是ANSI）。

但是，該腳本不適用於某些日志文件（在這些情況下，Notepad ++說“UCS-2 Little Endian”）。

我收到以下錯誤：

Traceback (most recent call last):
  File "simple.py", line 55, in <module>
    mb_toc_urlpart = "%20".join(str(x) for x in calculate_mb_toc_numbers(filter_
toc_entries(f)))
  File "simple.py", line 49, in calculate_mb_toc_numbers
    leadout_offset = int(eac[-1]['end_sector']) + 150 + 1
IndexError: list index out of range

這個日志有效

此日志中斷

我相信這是打破腳本的編碼，因為如果我只是在命令提示符下執行此操作：

type ascii.log > scrubbed.log

然后在scrubbed.log上運行腳本，腳本工作正常（這對我的目的來說很好，因為沒有丟失重要信息，我沒有寫回文件，只是打印到控制台）。

一種解決方法是在將日志文件傳遞給Python之前“擦除”日志文件（例如，使用上面的類型管道技巧到臨時文件，然后運行腳本），但我想讓Python“忽略”編碼如果它是可能的。 我也不確定如何檢測腳本正在讀取什么類型的日志文件，以便我可以采取適當的行動。

我正在讀這個和這個，但我的眼睛仍在他們的頭腦中旋轉，所以雖然這可能是我的長期戰略，但我想知道是否有一個我可以使用的臨時黑客。

Answer 1

codecs.open()允許您使用特定的編碼打開文件，它將生成unicode 。 您可以嘗試一些，從最可能的可能性到最不可能的（或者該工具可能總是產生UTF-16LE，但是有很多機會）。

此外， “Python中的Unicode，完全揭秘” 。

Answer 2

works.log似乎以ASCII編碼：

>>> data = open('works.log', 'rb').read()
>>> all(d < '\x80' for d in data)
True

breaks.log似乎以UTF-16LE編碼 - 它以2個字節'\\xff\\xfe' 。 breaks.log中的所有字符breaks.log在ASCII范圍內：

>>> data = open('breaks.log', 'rb').read()
>>> data[:2]
'\xff\xfe'
>>> udata = data.decode('utf16')
>>> all(d < u'\x80' for d in udata)
True

如果這是唯一的兩種可能性，那么你應該能夠逃脫以下攻擊。 更改主線代碼：

f = open(sys.argv[1])
mb_toc_urlpart = "%20".join(
    str(x) for x in calculate_mb_toc_numbers(filter_toc_entries(f)))
print mb_toc_urlpart

對此：

f = open(sys.argv[1], 'rb')
data = f.read()
f.close()
if data[:2] == '\xff\xfe':
    data = data.decode('utf16').encode('ascii')
# ilines is a generator which produces newline-terminated strings
ilines = (line + '\n' for line in data.splitlines())
mb_toc_urlpart = "%20".join(
    str(x) for x in calculate_mb_toc_numbers(filter_toc_entries(ilines))        )
print mb_toc_urlpart

Answer 3

Python 2.x期望普通字符串是ASCII（或至少一個字節）。 嘗試這個：

把它放在Python源文件的頂部：

from __future__ import unicode_literals

並將所有str更改為unicode 。

[編輯]

正如Ignacio Vazquez-Abrams所寫，嘗試使用codecs.open()來打開輸入文件。

Python kludge將UCS-2（UTF-16？）讀取為ASCII

問題描述

3 個解決方案

解決方案1
6 2011-03-09 05:55:41

解決方案2
3 已采納 2011-03-09 09:22:39

解決方案3
0 2011-03-09 05:57:11

Python kludge將UCS-2（UTF-16？）讀取為ASCII

問題描述

3 個解決方案

解決方案1 6 2011-03-09 05:55:41

解決方案2 3 已采納 2011-03-09 09:22:39

解決方案3 0 2011-03-09 05:57:11

解決方案1
6 2011-03-09 05:55:41

解決方案2
3 已采納 2011-03-09 09:22:39

解決方案3
0 2011-03-09 05:57:11