[英]Splitting ascii/unicode string
I am trying to decode ID3v2 (MP3 header) protocol using python. 我正在尝试使用python解码ID3v2(MP3标头)协议。 The format of the data to be decoded is as follows.
待解码数据的格式如下。
s1
, s2
, ... sn-1
are unicode (utf-16/utf-8) strings, and last string 'sn' may be unicode or binary string. s1
, s2
,... sn-1
是unicode(utf-16 / utf-8)字符串,最后一个字符串'sn'可以是unicode或二进制字符串。
data = s1+delimiters+s2+delimiters+...+sn
Where, delimiter for utf-16 is '\\x00'+'\\x00'
and delimiter for utf-8 is '\\x00'
其中,utf-16的分隔符为
'\\x00'+'\\x00'
,而utf-8的分隔符为'\\x00'
I get data
along with unicode-type. 我得到的
data
与Unicode类型。 Now I have to extract all the strings ( s1
, s2
, ... sn
) from data
. 现在,我必须从
data
提取所有字符串( s1
, s2
,... sn
)。 For this I am using split()
as follows, 为此,我使用
split()
如下,
#!/usr/bin/python
def extractStrings(encoding_type, data):
if(encoding_type == "utf-8"): delimitors = '\x00'
else: delimitors = '\x00'+'\x00'
return data.split(delimitors)
def main():
# Set-1
encoding_type = "utf-8"
delimitors = '\x00'
s1="Hello".encode(encoding_type)
s2="world".encode(encoding_type)
data = s1+delimitors+s2
print extractStrings(encoding_type, data)
# Set-2
encoding_type = "utf-16"
delimitors = '\x00'+'\x00'
s1="Hello".encode(encoding_type)
s2="world".encode(encoding_type)
data = s1+delimitors+s2
print extractStrings(encoding_type, data)
if __name__ == "__main__":
main()
output: 输出:
['Hello', 'world']
['\xff\xfeH\x00e\x00l\x00l\x00o', '\x00\xff\xfew\x00o\x00r\x00l\x00d\x00']
It works for set-1 data but doesn't work for set-2. 它适用于set-1数据,但不适用于set-2。 Since, 'data' in set-2
因为,set-2中的“数据”
'\xff\xfeH\x00e\x00l\x00l\x00o\x00\x00\x00\xff\xfew\x00o\x00r\x00l\x00d\x00'
^ ^
has an extra '\\x00'
precedes delimiter, due to letter '0', its unable to do a proper job. 由于字母“ 0”,在定界符之前有一个额外的
'\\x00'
,它无法正常工作。
Can anyone help me to decode 'data' properly for both the cases? 在这两种情况下,谁能帮助我正确解码“数据”?
Update:
更新:
I will try to simply the issue.
我将尝试简单地解决这个问题。
s1 = encoded (utf-8/utf-16) string
s1 =编码(utf-8 / utf-16)字符串
s2 = binary string (not unicode)
s2 =二进制字符串(不是unicode)
delimiter for utf-16 is
'\\x00'+'\\x00'
, and delimiter for utf-8 is '\\x00'
utf-16的分隔符是
'\\x00'+'\\x00'
,utf-8的分隔符是'\\x00'
data = (s1+delimiter)+s2
数据=(s1 +定界符)+ s2
Can anyone help me to extract s1 and s2 from 'data' ?
有人可以帮助我从“数据”中提取s1和s2吗?
Update2: Solution Update2:解决方案
The following code works for my requirement, 以下代码可满足我的要求,
def splitNullTerminatedEncStrings(self, data, encoding_type, no_of_splits): data_dec = data.decode(encoding_type, 'ignore') chunks = data_dec.split('\\x00', no_of_splits) enc_str_lst = [] for data_dec_seg in chunks[:-1]: enc_str_lst.append(data_dec_seg.encode(encoding_type)) data_dec_chunks = '\\x00'.join(chunks[:-1]) if(data_dec_chunks): data_dec_chunks += '\\x00' data_chunks = data_dec_chunks.encode(encoding_type) data_chunks_len = len(data_chunks) enc_str_lst.append(data[data_chunks_len:]) # last segment return enc_str_lst
Where, delimiter for utf-16 is '\\x00'+'\\x00' and delimiter for utf-8 is '\\x00'
其中,utf-16的分隔符为'\\ x00'+'\\ x00',而utf-8的分隔符为'\\ x00'
Not exactly. 不完全是。 The delimiter for UTF-16 is
\\0\\0
only at a code unit boundary. 仅在代码单元边界,UTF-16的分隔符为
\\0\\0
。 One \\0
at the end of one code unit followed by \\0
at the start of another code unit does not constitute a delimiter. 一个
\\0
在一个码单元,随后结束\\0
在另一代码单元不构成定界符的开始。 The ID3 standard, talking about byte 'synchronisation' implies that this isn't the case, but it's wrong. 讨论字节“同步”的ID3标准意味着不是这种情况,但这是错误的。
[Aside: unfortunately many tag-reading tools do take it literally that way, with the result that any sequence with a double-zero-byte in (eg U+0100,U+0061 Āa
in UTF-16BE, or, as you discovered, any ASCII at the end of a string in UTF-16LE) will break the frame. [此外:不幸的是,许多标签读取工具确实采用了这种方式,结果是任何带有双零字节的序列(例如,UTF-16BE中的U + 0100,U +
Āa
,或者您发现的,则UTF-16LE中字符串末尾的任何ASCII)都会中断帧。 As a result, UTF-16 text formats (UTF-16+BOM 0x01 and UTF-16BE 0x02) are completely unreliable and should be avoided by all tag writers. 结果,UTF-16文本格式(UTF-16 + BOM 0x01和UTF-16BE 0x02)是完全不可靠的,所有标记编写者都应避免使用。 And text format 0x00 is unreliable for anything but pure-ASCII.
文本格式0x00除纯ASCII以外的任何格式都不可靠。 UTF-8 is the winner!]
UTF-8是赢家!]
If you have a list-of-encoded-terminated-strings structure like those specified for the T
frames (other than TXXX
), then the simple approach is to just decode them before splitting on the U+0000 terminator: 如果您有一个编码的终止字符串列表结构,例如为
T
帧指定的结构(不是TXXX
),那么简单的方法是在对U + 0000终止符进行拆分之前对它们进行解码:
def extractStrings(encoding_type, data):
chars = data.decode(encoding_type)
# chars is now a Unicode string, delimiter is always character U+0000
return chars.split(u'\0')
If data
is a whole ID3 frame I'm afraid you can't process it with a single split()
. 如果
data
是一个完整的ID3帧,恐怕您不能使用单个split()
处理它。 Frames other than the T
family contain a mixture of encoded-terminated-strings, ASCII-only-terminated-strings, binary objects (which have no termination) and integer byte/word values. T
系列以外的其他帧包含混合编码的终止字符串,仅ASCII终止的字符串,二进制对象(无终止)和整数字节/字值。 APIC
is one such, but for the general case you'd have to know the structure of every frame you want to parse in advance, and consume each field one-by-one, finding each terminator manually as you go. APIC
就是这样一种情况,但是对于一般情况,您必须事先知道要解析的每个帧的结构,并逐个使用每个字段,并在运行时手动查找每个终结符。
To find the code-unit-aligned terminator in UTF-16-encoded data without misinterpreting Āa
et al, you could use a regex, eg: 要在UTF-16编码的数据中找到与代码单元对齐的终止符而不误解
Āa
等人,可以使用正则表达式,例如:
ix= re.match('((?!\0\0)..)*', data, re.DOTALL).end()
s, remainder= data[:ix], data[ix+2:]
This isn't a lot of fun really - ID3v2 is not a very clean format. 确实这没什么好玩的-ID3v2格式不是很干净。 Of the top of my head and untested, this sort of thing is how I might approach it:
在我的头顶上,未经测试,这种事情就是我可能会如何处理的:
p= FrameParser(data)
if frametype=='APIC':
encoding= p.encoding()
mimetype= p.string()
pictype= p.number(1)
desc= p.encodedstring()
img= p.binary()
class FrameParser(object):
def __init__(self, data):
self._data= data
self._ix= 0
self._encoding= 0
def encoding(self): # encoding byte - remember for later call to unicode()
self._encoding= self.number(1)
if not 0<=self._encoding<4:
raise ValueError('Unknown ID3 text encoding %r' % self._encoding)
return self._encoding
def number(self, nbytes= 1):
n= 0
for i in nbytes:
n*= 256
n+= ord(self._data[self._ix])
self._ix+= 1
return n
def binary(self): # the whole of the rest of the data, uninterpreted
s= self._data[self._ix:]
self._ix= len(self._data)
return s
def string(self): # non-encoded, maybe-terminated string
return self._string(0)
def encodedstring(self): # encoded, maybe-terminated string
return self._string(self._encoding)
def _string(self, encoding):
if encoding in (1, 2): # UTF-16 - look for double zero byte on code unit boundary
ix= re.match('((?!\0\0)..)*', self._data[self._ix:], re.DOTALL).end()
s= self._data[self._ix:self._ix+ix]
self._ix+= ix+2
else: # single-byte encoding - look for first zero byte
ix= self._data.find('\0', self._ix)
s= self._data[self._ix:self._ix+ix] if ix!=-1 else self._data[self._ix:]
self._ix= ix if ix!=-1 else len(self._data)
return s.decode(['windows-1252', 'utf-16', 'utf-16be', 'utf-8][encoding])
Why don't you decode the strings first? 为什么不先解码字符串?
Python 2: Python 2:
decoded = unicode(data, 'utf-8')
# or
decoded = unicode(data, 'utf-16')
Python 3: Python 3:
decoded = str(data, 'utf-8')
# or
decoded = str(data, 'utf-16')
Then you work directly with encoding-agnostic data and the delimiters are always a single null. 然后,您可以直接使用与编码无关的数据,并且分隔符始终为单个null。
The following code works for my requirement, 以下代码可满足我的要求,
def splitNullTerminatedEncStrings(self, data, encoding_type, no_of_splits):
data_dec = data.decode(encoding_type, 'ignore')
chunks = data_dec.split('\x00', no_of_splits)
enc_str_lst = []
for data_dec_seg in chunks[:-1]:
enc_str_lst.append(data_dec_seg.encode(encoding_type))
data_dec_chunks = '\x00'.join(chunks[:-1])
if(data_dec_chunks): data_dec_chunks += '\x00'
data_chunks = data_dec_chunks.encode(encoding_type)
data_chunks_len = len(data_chunks)
enc_str_lst.append(data[data_chunks_len:]) # last segment
return enc_str_lst
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.