简体   繁体   中英

Python regex to get n characters before and after a keyword in a line of text

I'm trying to parse trough a file and search for a keyword in a list of strings. I need to return the 'n' characters before and after each occurrence. I have it working without regex but it's not very efficient. Any idea how to do the same with regex and findall? Lookup is a list of strings. This is what I have without regex:

with open(file, 'r') as temp:
    for num, line in enumerate(temp, 1):
        for string in lookup:
            if string in line:

                # Split the line in 2 substrings
                tmp1 = line.split(string)[0]
                tmp2 = line.split(string)[1]

                # Truncate only 'n' characters before and after the keyword
                tmp = tmp1[-n:] + string + tmp2[:n]

                # Do something here...

This is the start with regex:

with open(file, 'r') as temp:
    for num, line in enumerate(temp, 1):
        for string in lookup:
            # Regex search with Ignorecase
            searchObj = re.findall(string, line, re.M | re.I)

            if searchObj:
                print "search --> : ", searchObj

                # Loop trough searchObj and get n characters 

From https://docs.python.org/2/library/re.html

start([group])
end([group])
   Return the indices of the start and end of the substring matched by 
   group; group defaults to zero (meaning the whole matched substring). 
   Return -1 if group exists but did not contribute to the match. For a 
   match object m, and a group g that did contribute to the match, the 
   substring matched by group g (equivalent to m.group(g)) is


    m.string[m.start(g):m.end(g)]

    Note that m.start(group) will equal m.end(group) if group matched a 
    null string. For example, after m = re.search('b(c?)', 'cba'), 
    m.start(0) is 1, m.end(0) is 2, m.start(1) and m.end(1) are both    
    2, and m.start(2) raises an IndexError exception.

Using re.finditer you can generate an iterator of MatchObject and then use these attributes to get the start and end of your substrings.

I got it to work. Below is the code if anyone needs it:

with open(file, 'r') as temp:
    for num, line in enumerate(temp, 1):
        for string in lookup:

            # Regex
            searchObj = re.finditer(string, line, re.M | re.I)

            if searchObj:
                for match in searchObj:

                    # Find the start index of the keyword
                    start = match.span()[0]

                    # Find the end index of the keyword
                    end = match.span()[1]

                    # Truncate line to get only 'n' characters before and after the keyword
                    tmp = line[start-n:end+n] + '\n'            
                    print tmp

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM