简体   繁体   中英

Python Regular Expression Extract Chunk of Data From Binary File

I've a binary file. From that file I need to extract few chunk of data using python regular expression.

I need to extract non null characters-set present in-between null characters sets.

For example this is the main character set:

\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\xff\\xfe\\xfe\\x00\\x00\\x23\\x41\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x41\\x49\\x57\\x00\\x00\\x00\\x00\\x32\\x41\\x49\\x57\\x00\\x00\\x00\\x00\\x32\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x56\\x65\\x00\\x35\\x56

The regex should extract below character sets from above master set:

\\xff\\xfe\\xfe\\x00\\x00\\x23\\x41, \\x41\\x49\\x57\\x00\\x00\\x00\\x00\\x32\\x41\\x49\\x57\\x00\\x00\\x00\\x00\\x32 and \\x56\\x65\\x00\\x35\\x56

One thing is important, If it gets more than 5 null bytes continuously then only it should treat these null characters set as separator..otherwise it should include this null bytes into no-null character. As you can see in given example few null characters are also present in extracted character set.

If its not making any sense please let me know I will try to explain it in a better manner.

Thanks in Advance,

You can use split and lstrip with list comprehension as:

s='\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\xff\xfe\xfe\x00\x00\x23\x41\x00\x00\x00\x00\x00\x00\x00\x00\x41\x49\x57\x00\x00\x00\x00\x32\x41\x49\x57\x00\x00\x00\x00\x32\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x56\x65\x00\x35\x56'
sp=s.split('\x00\x00\x00\x00\x00')
print [i.lstrip('\x00\\')  for i in sp if i != ""]

Output:

['\xff\xfe\xfe\x00\x00#A', 'AIW\x00\x00\x00\x002AIW\x00\x00\x00\x002', 'Ve\x005V']
  1. split entire data based on 5 nul values.
  2. in the list, find if any element is starting with nul and if it's starting remove them (this works for variable number of nul replacement at start).

You could split on \\x00{5,}
This is 5 or more zero's. Its the delimeter you specified.

In Perl, its something like this

Perl test case

$strLangs =  "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\xff\xfe\xfe\x00\x00\x23\x41\x00\x00\x00\x00\x00\x00\x00\x00\x41\x49\x57\x00\x00\x00\x00\x32\x41\x49\x57\x00\x00\x00\x00\x32\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x56\x65\x00\x35\x56";

# Remove leading zero's (5 or more)
$strLangs =~ s/^\x00{5,}//;

# Split on 5 or more 0's
@Alllangs = split /\x00{5,}/, $strLangs;

# Print each language characters
foreach $lang (@Alllangs)
{
    print "<";
    for ( split //, $lang ) {
       printf( "%x,", ord($_)); 
    }
    print ">\n";

}

Output >>

<ff,fe,fe,0,0,23,41,>
<41,49,57,0,0,0,0,32,41,49,57,0,0,0,0,32,>
<56,65,0,35,56,>

Here's how to do it in Python. I had to str.strip() off and leading and trailing nulls to get the regex pattern to prevent the inclusion of an extra empty string at the beginning of the list of results returned from re.split() .

import re

data = ('\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\xff\xfe\xfe\x00\x00\x23\x41'
        '\x00\x00\x00\x00\x00\x00\x00\x00\x41\x49\x57\x00\x00\x00\x00\x32\x41'
        '\x49\x57\x00\x00\x00\x00\x32\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'
        '\x00\x00\x00\x00\x00\x56\x65\x00\x35\x56'
        '\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00')

chunks = re.split(r'\000{6,}', data.strip('\x00'))

# display results
print ',\n'.join(''.join('\\x'+ch.encode('hex_codec') for ch in chunk) 
                         for chunk in chunks),

Output:

\xff\xfe\xfe\x00\x00\x23\x41,
\x41\x49\x57\x00\x00\x00\x00\x32\x41\x49\x57\x00\x00\x00\x00\x32,
\x56\x65\x00\x35\x56

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM