简体   繁体   中英

regex not matching

I am write a small python script to gather some data from a database, the only problem is when I export data as XML from mysql it includes a \b character in the XML file. I wrote code to remove it, but then realized I didn't need to do that processing everytime, so I put it in a method and am calling it I find a \b in the XML file, only now the regex isnt matching, even though I know the \b is there.

here is what I am doing:

Main program:

'''Program should start here'''
#test the file to see if processing is needed before parsing
for line in xml_file:
    p = re.compile("\b")
    if(p.match(line)):
        print p.match(line)
        processing = True
        break #only one match needed

if(processing):
    print "preprocess"
    preprocess(xml_file)

Preprocessing method:

def preprocess(file):
    #exporting from MySQL query browser adds a weird
    #character to the result set, remove it
    #so the XML parser can read the data
    print "in preprocess"
    lines = []
    for line in xml_file:
        lines.append(re.sub("\b", "", line))

    #go to the beginning of the file
    xml_file.seek(0);
    #overwrite with correct data
    for line in lines:
        xml_file.write(line);
    xml_file.truncate()

Any help would be great, Thanks

\b is a flag for the regular expression engine :

Matches the empty string, but only at the beginning or end of a word. A word is defined as a sequence of alphanumeric or underscore characters, so the end of a word is indicated by whitespace or a non-alphanumeric, non-underscore character. Note that \b is defined as the boundary between \w and \W, so the precise set of characters deemed to be alphanumeric depends on the values of the UNICODE and LOCALE flags. Inside a character range, \b represents the backspace character, for compatibility with Python's string literals.

So you will need to escape it to find it with a regex.

Escape it with backslash in regex. Since backslash in Python needs to be escaped as well (unless you use raw strings which you don't want to), you need a total of 3 backslashes:

p = re.compile("\\\b")

This will produce a pattern matching the \b character.

Correct me if i wrong but there is no need to use regEx in order to replace '\b', you can simply use replace method for this purpose:

def preprocess(file):
    #exporting from MySQL query browser adds a weird
    #character to the result set, remove it
    #so the XML parser can read the data
    print "in preprocess"
    lines = map(lambda line: line.replace("\b", ""), xml_file)
    #go to the beginning of the file
    xml_file.seek(0)
    #overwrite with correct data
    for line in lines:
        xml_file.write(line)
    # OR: xml_file.writelines(lines)
    xml_file.truncate()

Note that there is no need in python to use ';' at the end of string

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM