简体   繁体   中英

substring extract in a file using Python Regex

A file has n number of lines in blocks of logically defined strings. I'm parsing each line and capturing the required data based on some matching conditions.

I have read through each line and finding the blocks with this code:

#python
    for lines in file.readlines():
        if re.match(r'block.+',lines)!= None:
            block_name = re.match(r'block.+', lines).group(0)
            # string matching code to be added here

Input File:

line1    select KT_TT=$TMTL/$SYSNAME.P1
line2    . $dhe/ISFUNC sprfl tm/tm1032 int 231
line3    select IT_TT=$TMTL/$SYSNAME.P2
line4    . $DHE/ISFUNC ptoic ca/ca256 tli 551
         .....
         .....


line89   CALLING IK02=$TMTL/$SYSNAME.P2
line90   CALLING KK01=$TMTL/$SYSNAME.P1

Matching conditions & expected output of each step:

  1. While reading the lines, match the word "/ISFUNC" and fetch the characters from the last till it matches a "/" and save it to a variable. Expected o/p->tm1032 int 231, ca256 tli 551 (matching string found in line2 & line 4, etc)
  2. Once ISFUNC is found, read the immediate previous line and fetch the data from that line, start form the last character till it matches a "/" and save it to a variable. Expected o/p->$SYSNAME.P1 & $SYSNAME.P2(line 1 & line 3, etc)
  3. Continue reading the lines down and look for the line starting with "CALLING" and the last string after "/" should match with o/p of step 2($SYSNAME.P1 & $SYSNAME.P2). Just capture the data after CALLING word and save it. expected o/p -> KK01 (line 90) & IK02(line 89)

final output should be like

FUNC             SYS            CALL
tm1032 int 231   $SYSNAME.P1    KK01
ca256 tli 551    $SYSNAME.P2    IK02 

If all you need is the text next to the last slash, you need not go for regex at all .

Simply use the .split("/") on each line and you can get the last part next to the slash

sample = "$dhe/ISFUNC sprfl tm/tm1032 int 231"
sample.split("/")

will result in

['$dhe', 'ISFUNC sprfl tm', 'tm1032 int 231']

and then just access the last element of the list using -1 indexing to get the value

PS : Use the split function once you have found the corresponding line

While reading the lines, match the word "/ISFUNC" and fetch the characters from the last till it matches a "/" and save it to a variable. Expected o/p->tm1032 int 231 (matching string found in line2)

char_list = re.findall(r'/ISFUNC.*/(.*)$', line)
if char_list:
    chars = char_list[0]

Once ISFUNC is found, read the immediate previous line and fetch the data from that line, start form the last character till it matches a "/" and save it to a variable. Expected o/p->$SYSNAME.P1 (line 1)

The ideal approach here is to either (a) iterate through the list indices rather than the lines themselves (ie for i in range(len(file.readlines()): ... file.readlines()[i] ) or (b) maintain a copy of the last line (say, put last_line = line at the end of your for loop. Then, reference that last line for this expression:

data_list = re.findall(r'/([^/]*)$', last_line)
if data_list:
    data = data_list[0]

Continue reading the lines down and look for the line starting with "CALLING" and the last string after "/" should match with o/p of step 2($SYSNAME.P1). Just capture the data after CALLING word and save it. expected o/p -> KK01 (line 90)

Assuming, from your example, you mean "just the data immediately after (ie up until the equals sign):

calling_list = re.findall(r'CALLING(.*)=.*/' + re.escape(data) + '$', line) 
if calling_list:
    calling = calling_list[0]

You can move the parentheses around to change what from that line exactly you want to capture. re.findall() will output a list of matches, including only the bits inside the parentheses that were matched.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM