简体   繁体   English

如何像Python 2.7中那样从字节字符串中提取混合的二进制和ascii值?

[英]How can I extract mixed binary and ascii values from a bytes string as in Python 2.7?

The following represents a binary image extracted from a file (spaces inserted between bytes to make reading easier). 以下是从文件中提取的二进制图像(在字节之间插入空格以使读取更容易)。 File is opened with 'rb' mode. 文件以“ rb”模式打开。

01 77 33 9F 41 42 43 44 00 11 11 11 01 77 33 9F 41 42 43 44 00 11 11 11

In Python 2.7, I read it as a character string and I use ord() to extract the binary values and then I can extract or even search the string for a specific text value (such as the "ABCD" in characters 4-7). 在Python 2.7中,我将其读取为字符串,并使用ord()提取二进制值,然后可以提取甚至搜索字符串中的特定文本值(例如字符4-7中的“ ABCD”) 。 The binary bytes can be anything from 0-FF. 二进制字节可以是0-FF之间的任何值。 I've been putting off conversion to python 3 partly because of this. 由于这个原因,我一直在推迟到python 3的转换。

I need to be able, in Python 3, to treat a string of bytes as a mixture of binary and ascii (not unicode) values. 我需要能够在Python 3中将字节字符串视为二进制和ascii(而非unicode)值的混合。 The format is not fixed, it consists of data structures. 格式不是固定的,它由数据结构组成。 For example, the 33 in byte 2 might be a record length that tells me where the start of the next record is. 例如,字节2中的33可能是一条记录长度,它告诉我下一条记录的开始位置。 In other words, I can't just say that I know the text string is always in location 4. 换句话说,我不能只说我知道文本字符串始终位于位置4。

I don't write the file, I just use it, so changing it is not an option. 我不写文件,只是使用它,因此更改它不是一种选择。

I've seen lots of examples of using b' and other things to convert fixed strings but I need a way to intermix these values, extracting bytes, 2-byte to 8-byte values as 16-bit to 64-bit words, and extracting/searching for ASCII strings within the larger string. 我已经看到了许多使用b'和其他东西转换固定字符串的示例,但是我需要一种方法来混合这些值,提取字节,2字节到8字节的值(从16位到64位的字),以及在较大的字符串中提取/搜索ASCII字符串。

The byte/character separation in Python 3 seems somewhat inflexible for what I need. 对于我需要的东西,Python 3中的字节/字符分隔似乎有些不灵活。 I'm sure there's a way to do this I just haven't found an example or an answered question that seems to cover this case. 我确定有办法做到这一点,但我还没有找到一个似乎可以解决此问题的示例或已回答的问题。

This is a simplified example, I can't provide real data (it's proprietary) but this illustrates the problem. 这是一个简化的示例,我无法提供真实数据(它是专有数据),但这说明了问题。 The real files may be short (<1K) or huge (>100K), containing multiple records of different sizes. 实际文件可能很短(<1K)或很大(> 100K),其中包含多个不同大小的记录。

Is there an easy, straightforward way to essentially replicate the functionality I have in Python 2.7? 是否有一种简单直接的方法来本质上复制我在Python 2.7中具有的功能?

This is on Windows. 这是在Windows上。

Thanks 谢谢

I need to be able, in Python 3, to treat a string of bytes as a mixture of binary and ascii (not unicode) values. 我需要能够在Python 3中将字节字符串视为二进制和ascii(而非unicode)值的混合。 The format is not fixed, it consists of data structures. 格式不是固定的,它由数据结构组成。 For example, the 33 in byte 2 might be a record length that tells me where the start of the next record is. 例如,字节2中的33可能是一条记录长度,它告诉我下一条记录的开始位置。 In other words, I can't just say that I know the text string is always in location 4. 换句话说,我不能只说我知道文本字符串始终位于位置4。

  1. Read the file in binary mode, as you are doing. 按照您的操作,以二进制模式读取文件。 This produces a bytes object, which in 3.x is not the same as a str (as it would be in 2.x). 这将产生一个bytes对象,该对象在3.X是一样的一个str (因为这将是在2.X)。

  2. Interpret the bytes as bytes, as needed, to figure out the general structure of the data. 根据需要将字节解释为字节,以弄清楚数据的一般结构。 Slicing the bytes produces another bytes as before; 切片bytes会像以前一样产生另一个bytes indexing produces an int with the numeric value of that single byte ( not as before) - no ord required. 索引产生int与单个字节( 如前)的数字值-没有ord必需的。

  3. When you have determined a subset of the bytes that represent a string (let's say for convenience that you have sliced it out), convert to string using the appropriate encoding : eg str(my_bytes, 'ascii') . 确定了代表字符串的字节子集后(为方便起见,您将其切成薄片),请使用适当的编码转换为字符串:例如str(my_bytes, 'ascii') Note that ASCII will not handle byte values 0x80 through 0xFF; 请注意,ASCII将处理字节值0x80到0xFF。 especially with binary-ish legacy file formats, there's a good chance your data is actually something like Latin-1: str(my_bytes, 'iso-8859-1') . 特别是对于二进制格式的旧文件格式,您的数据很有可能实际上是类似于Latin-1的东西: str(my_bytes, 'iso-8859-1')

search the string for a specific text value 在字符串中搜索特定的文本值

You can search at either the text or the byte level - bytes objects support the in operator, searching for either a subsequence of bytes or a single integer value. 您可以在文本级别或字节级别进行搜索- bytes对象支持in运算符,可以搜索bytes的子序列或单个整数值。 Whether it makes more sense to search before or after string conversion will depend on what you are doing. 在字符串转换之前还是之后进行搜索更有意义,这取决于您在做什么。

using b' and other things to convert fixed strings 使用b'和其他东西来转换固定字符串

b'' is just the syntax for a literal bytes object. b''只是文字 bytes对象的语法。 It's what you'll see if you ask for the repr of what you read from the file. 如果您要求repr从文件中读取的内容,就会看到此内容。 Prefixing a b onto an existing string literal in your code isn't really "converting" anything, but replacing it with the value you should have had in the first place. 在代码中的现有字符串文字上加上b并不是真正的“转换”任何东西,而是将其替换为您应该首先拥有的值。

2-byte to 8-byte values as 16-bit to 64-bit words 2字节至8字节的值(16位至64位字)

The documentation says it at least as well as I could: 该文档至少说得尽我所能:

>>> help(int.from_bytes)
Help on built-in function from_bytes:

from_bytes(...) method of builtins.type instance
    int.from_bytes(bytes, byteorder, *, signed=False) -> int

    Return the integer represented by the given array of bytes.

    The bytes argument must be a bytes-like object (e.g. bytes or bytearray).

    The byteorder argument determines the byte order used to represent the
    integer.  If byteorder is 'big', the most significant byte is at the
    beginning of the byte array.  If byteorder is 'little', the most
    significant byte is at the end of the byte array.  To request the native
    byte order of the host system, use `sys.byteorder' as the byte order value.

    The signed keyword-only argument indicates whether two's complement is
    used to represent the integer.

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM