Why does this python regular expression return the wrong string

Question

Below I have a piece of code that should replace one string with another but doesnt seem to do it. I am not a python or regular expression expert, can anyone tell me why this might be going wrong.

def ReplaceCRC( file_path ):
    file = open(file_path,'r+');
    file_str = file.read()
    if( file_str <> '' ):
         crc_list        = re.findall(r'_CalcCRC[(]\s*"\w+"\s*[)]', file_str);
         strs_to_crc     = []
         new_crc_list    = []
         if( crc_list ):
              for crc in crc_list:
                   quote_to_crc    = re.search(r'"\w+"', crc);
                   str_to_crc      = re.search(r'\w+', quote_to_crc.group() ).group();
                   final           = hex(CalcCRC( str_to_crc ))[:2]
                   value           = '%08X' % CalcCRC( str_to_crc )
                   final           = final + value.upper()
                   final_crc       = Insert( crc, ', ' + final + ' ', -1)
                   new_crc_list.append( final_crc )
              if( new_crc_list <> [] ):
                   for i in range(len(crc_list)):
                       print crc_list[i]
                       print new_crc_list[i]
                       term = re.compile( crc_list[i] );
                       print term.sub( new_crc_list[i], file_str );

This is the file it is operating on:

printf( "0x%08X\n", _CalcCRC("THIS_IS_A_CRC") );
printf( "0x%08X\n", _CalcCRC("PATIENT_ZERO") );

This is the output

_CalcCRC("THIS_IS_A_CRC")
_CalcCRC("THIS_IS_A_CRC", 0x97DFEAC9 )
printf( "0x%08X\n", _CalcCRC("THIS_IS_A_CRC") );
printf( "0x%08X\n", _CalcCRC("PATIENT_ZERO") );

_CalcCRC("PATIENT_ZERO")
_CalcCRC("PATIENT_ZERO", 0x0D691C21 )
printf( "0x%08X\n", _CalcCRC("THIS_IS_A_CRC") );
printf( "0x%08X\n", _CalcCRC("PATIENT_ZERO") );

What it should do is find the CRC string, calculate the value and then put a string in its place in the original string. I have been trying a bunch of stuff, but nothing seems to work.

Answer 1

Not your problem, but these 3 lines are amazing:

final           = hex(CalcCRC( str_to_crc ))[:2]
value           = '%08X' % CalcCRC( str_to_crc )
final           = final + value.upper()

Assuming CalcCRC returns a non-negative integer (eg 12345567890

Line 1 sets final to "0x" irrespective of the input!

>>> hex(1234567890)
'0x499602d2'
>>> hex(1234567890)[:2]
'0x'

Line 2 repeats the call to CalcCRC!

>>> value           = '%08X' % 1234567890
>>> value
'499602D2'

Note that value is already uppercase!

and after line 3, final becomes '0x499602D2'

As value is not used again, the whole thing can be replaced by

final = '0x%08X' % CalcCRC(str_to_crc)

More from Circumlocution City

These lines:

quote_to_crc    = re.search(r'"\w+"', crc);
str_to_crc      = re.search(r'\w+', quote_to_crc.group() ).group();

can be replaced by one of:

str_to_crc = re.search(r'"\\w+"', crc).group()[1:-1] str_to_crc = re.search(r'"(\\w+)"', crc).group(1)

Answer 2

A quick peek at the real answer:

You need (inter alia) to use re.escape() ....

term = re.compile(re.escape(crc_list[i]))

and the indentation on your last if looks stuffed.

... more after dinner :-)

Post-prandial update

You make 3 passes over the whole file, when only one will do the trick. Apart from cutting out an enormous lot of clutter, the main innovation is to use the re.sub functionality that allows the replacement to be a function instead of a string.

import re
import zlib

def CalcCRC(s):
    # This is an example. It doesn't produce the same CRC as your examples do.
    return zlib.crc32(s) & 0xffffffff

def repl_func(mobj):
    str_to_crc = mobj.group(2)
    print "str_to_crc:", repr(str_to_crc)
    crc = CalcCRC(str_to_crc)
    # If my guess about Insert(s1, s2, n) was wrong,
    # adjust the ollowing statement.
    return '%s"%s", 0x%08X%s' % (mobj.group(1), mobj.group(2), crc, mobj.group(3))

def ReplaceCRC(file_handle):
    regex = re.compile(r'(_CalcCRC[(]\s*)"(\w+)"(\s*[)])')
    for line in file_handle:
        print "line:", repr(line)
        line2 = regex.sub(repl_func, line)
        print "line2:", repr(line2)
    return

if __name__ == "__main__":
    import sys, cStringIO
    args = sys.argv[1:]
    if args:
        f = open(args[0], 'r')
    else:
        f = cStringIO.StringIO(r"""
printf( "0x%08X\n", _CalcCRC("THIS_IS_A_CRC") )
other_stuff()
printf( "0x%08X\n", _CalcCRC("PATIENT_ZERO") )
""")
    ReplaceCRC(f)

Result of running script with no args:

line: '\n'
line2: '\n'
line: 'printf( "0x%08X\\n", _CalcCRC("THIS_IS_A_CRC") )\n'
str_to_crc: 'THIS_IS_A_CRC'
line2: 'printf( "0x%08X\\n", _CalcCRC("THIS_IS_A_CRC", 0x98ABAC4B) )\n'
line: 'other_stuff()\n'
line2: 'other_stuff()\n'
line: 'printf( "0x%08X\\n", _CalcCRC("PATIENT_ZERO") )\n'
str_to_crc: 'PATIENT_ZERO'
line2: 'printf( "0x%08X\\n", _CalcCRC("PATIENT_ZERO", 0x76BCDA4E) )\n'

Answer 3

Is this want you want ? :

import re

def ripl(mat):
    return '%s, 0x%08X' % (mat.group(1),CalcCRC(mat.group(2)))

regx = re.compile(r'(_CalcCRC[(]\s*"(\w+)"\s*[)])')


def ReplaceCRC( file_path, regx = regx, ripl = ripl ):
    with open(file_path,'r+') as f:
        file_str = f.read()
        print file_str,'\n'
        if file_str:
             file_str = regx.sub(ripl,file_str)
             print file_str
             f.seek(0,0)
             f.write(file_str) 
             f.truncate()

EDIT

I had forgot the instruction f.truncate() , very important, otherwise it remains a tail if the rewritten content is shorter than the initial content

.

EDIT 2

John Machin,

There is no mistake, my above solution is right, it gives

printf( "0x%08X\n", _CalcCRC("THIS_IS_A_CRC"), 0x97DFEAC9 ); 
printf( "0x%08X\n", _CalcCRC("PATIENT_ZERO"), 0x0D691C21 );

I hadn't changed it since your comment. I think that I first posted a solution that was incorrect (because I performed some various tests to verify some behaviors and, you know, I sometimes do mix-up with my files and codes), then you copied this incorrect code to try it, then I realized that there was a mistake and corrected the code, and then you posted your comment without noticing I had corrected. I imagine no other cause of such a confusion.

By the way, to obtain the same result, there's even no need of two groups in the pattern defining regx , one alone is sufficient. These following regx and ripl() work as well:

regx = re.compile(r'_CalcCRC\(\s*"(\w+)"\s*\)')
# I prefer '\(' to '[(]', and same for '\)' instead of '[)]'

def ripl(mat):
    return '%s, 0x%08X' % (mat.group(),CalcCRC(mat.group(1)))

But an uncertainty remains. Each of our result is wise, relativelay to the inaccurate wording of Joe. So, what does he want as precise result ? : must the value 0x97DFEAC9 be inserted in CalcCRC("THIS_IS_A_CRC") as in your result, or after CalcCRC("THIS_IS_A_CRC") as in mine ?

To say all, I did like you to obtain a code that could be run: I defined a function CalcCRC() of my own consisting simply in if x=="THIS_IS_A_CRC": return 0x97DFEAC9 and if x=="PATIENT_ZERO": return 0x0D691C21 ; I picked these associations out by seeing the results desired by Joe exposed in his question.

Now , concerning your nasty affirmation that my "point about redefinition of functions is utter nonsense" , I think I didn't explain enough what I mean. Putting the regex regx and the function ripl() as default arguments to the parameters of the function ReplaceCRC() has a consequence : the objects regx and ripl() are created only one time, at the moment the definition of function ReplaceCRC() is executed. So, in case that ReplaceCRC() will be applied several times in an execution, there will be no re-creation of these objects. I don't know if the function ReplaceCRC() is really called several times during the execution of Joe's program, but I think it's a good practice to put this feature in a code in case it may be useful. Maybe, I should have underlined this point in my answer instead of a comment to justify my code relatively to yours. But I try to limit my tendency to write sometimes answers long too much.

Are the points clarified and your annoyance soothed by these explanations ?

Why does this python regular expression return the wrong string

Question

3 answers

solution1
1 2011-04-22 08:46:49

solution2
0 2011-04-22 09:18:23

solution3
0 2011-04-22 11:27:10

EDIT

EDIT 2

Why does this python regular expression return the wrong string

Question

3 answers

solution1 1 2011-04-22 08:46:49

solution2 0 2011-04-22 09:18:23

solution3 0 2011-04-22 11:27:10

EDIT

EDIT 2

solution1
1 2011-04-22 08:46:49

solution2
0 2011-04-22 09:18:23

solution3
0 2011-04-22 11:27:10