简体   繁体   中英

using regex in python to extract specific strings from a text file

I have a text file having the error report from a VDHL code compilation. I wanted to automate ad[few things which required to pull out some data from this file. I am specifically looking for the string that lies after "[Hierarchy: 'block_ram_inst." which in my case is block_ram_top or block_ram_top_1 and the file paths for it. I also want to extract the port name for that particular line.

InOut-Report                         Error     /drive/build/users/tempuser/MCpro/projectphase1/MCpro_phase_1/project227/test_prj/mem/pro_1/pro_1.srcs/sources/code/mem_gen/vhdlcode/ramblock.vhd 241  10   Port 'clk' is not registered [Hierarchy: 'block_ram_inst.block_ram_top']
InOut-Report                         Error     /drive/build/users/tempuser/MCpro/projectphase1/MCpro_phase_1/project227/test_prj/mem/pro_1/pro_1.srcs/sources/code/mem_gen/vhdlcode/ramblock.vhd 113  10   Port 'dina[31:0]' is not registered [Hierarchy: 'block_ram_inst.block_ram_top_1']
InOut-Report                         Error     /drive/build/users/tempuser/MCpro/projectphase1/MCpro_phase_1/project227/test_prj/mem/pro_1/pro_1.srcs/sources/code/mem_gen/vhdlcode/ramblock.vhd 325  10   Port 'clk' is not registered [Hierarchy: 'block_ram_inst.block_ram_top']
InOut-Report                         Error     /drive/build/users/tempuser/MCpro/projectphase1/MCpro_phase_1/project227/test_prj/mem/pro_1/pro_1.srcs/sources/code/mem_gen/vhdlcode/ramblock.vhd 152  10   Port 'clk' is not registered [Hierarchy: 'block_ram_inst.block_ram_top_1']
InOut-Report                         Error     /drive/build/users/tempuser/MCpro/projectphase1/MCpro_phase_1/project227/test_prj/mem/pro_1/pro_1.srcs/sources/code/mem_gen/vhdlcode/ramblock.vhd 318  10   Port 'wea[0]' is not registered [Hierarchy: 'block_ram_inst.block_ram_top']
InOut-Report                         Error     /drive/build/users/tempuser/MCpro/projectphase1/MCpro_phase_1/project227/test_prj/mem/pro_1/pro_1.srcs/sources/code/mem_gen/vhdlcode/ramblock.vhd 289  10   Port 'clk' is not registered [Hierarchy: 'block_ram_inst.block_ram_top_1']

I have written a code to extract the string that lies after hierarchy and the file name..however i am not able to extract the full path for the file and the port name.

Here is my code.

with open(filename,'r') as f:
    targets = [line for line in f if "InOut-Report" in line]
    filenames = []
    data = []
    for line in targets:
        match = re.match(r"InOut-Report.*/([-A-Za-z0-9_://.]+).*\[Hierarchy: 'block_ram_inst\.(\w+)']", line)
        if match:
            filenames.append(match.group(1))
            data.append(match.group(2))
print filenames             
print data

the output i get is

['ramblock.vhd', 'ramblock.vhd', 'ramblock.vhd', 'ramblock.vhd', 'ramblock.vhd', 'ramblock.vhd', 'ramblock.vhd', 'ramblock.vhd']
['block_ram_top', 'block_ram_top_1', 'block_ram_top', 'block_ram_top_1', 'block_ram_top', 'block_ram_top_1', 'block_ram_top', 'block_ram_top_1']

But i want to include the full path in my output for the filename..not just the filename. Also i want to extract the port names from each line in a sepearte list.

Treating the contents of the file as a string, you can match the regular expression

r'(?m) ((?:\/[\w.]+)+) .* 'block_ram_inst\.([\w.]+)'\]$'

For each match the path will be held in capture group 1 and the port name will be held in capture group 2.

Start your engine!

Python's regex engine performs the following operations.

  (?m) 
  [ ]
  (                  : begin capture group 1
    (?:\/[\w.]+)     : match '/' then 1+ word chars or periods in a
                       non-capture group    
    +                : execute non-capture group 1+ times
  )                  : end capture group 1
  [ ].*[ ].          : match a space, then 0+ chars then ' .'
  'block_ram_inst\.  : match "'block_ram_inst." 
  ([\w.]+)           : match 1+ words characters of periods in capture
                       group 2
  '\]                : match "']"
  $                  : match end of string

I've expressed space above as capture groups containing a space ( [ ] ) merely so they can be seen.

The regex you posted doesn't seem to be working but as your question is clear, I tried with the following regex:

r"InOut-Report\s+Error\s+([-\w://.]+)\s+\d+\s+\d+\s+Port '(\w+)' is not registered \[Hierarchy: 'block_ram_inst\.(block_ram_top(?:_\d)?)'\]"

See it in action here

  • InOut-Report : matches that string literally, nothing fancy.
  • \s+ : \s matches any whitespace character (spaces, tabs, new lines, ...) and the + modifier specifies that there can be 1 or more.
  • ([-\w://.]+) : \w means any alphanumeric or underscore ( A-Za-z0-9_ ). You also added dashes ( - ), colon ( : ), slashes ( / ) and dots ( . ). Now that I see it again, the second slash is redundant (it was copied from your question) so you could just leave one of them. The square brackets with the modifier [...]+ specify that any of the characters in the list inside needs to be 1 or more times. Being surrounded by brackets (...) means that it is a capture group, the path.
  • \s+ : see above.
  • \d+ : similar to above but \d means any digit ( 0-9 ) so this 1 or more digits.
  • \s+ : see above.
  • \d+ : see above.
  • \s+ : see above.
  • Port ' : matches that string literally, nothing fancy.
  • (\w+) : second capture group of 1 or more alphanumeric characters or underscore, the port name.
  • ' is not registered \[Hierarchy: 'block_ram_inst\. : matches that string literally, we just scape [ and . because they have special meanings and we want that literal character not the special meaning.
  • (block_ram_top(?:_\d)?) : block_ram_top is the literal string, nothing fancy. _\d means an underscore followed by a digit. They are inside a non-capture group (?:...) . A non-capture group allows to group parts but don't save them for later as we were doing with the path or the port name. In this case we are grouping them to apply the ? modifier that can be found behind. This means that it can be 0 or 1 times. So (?:_\d)? means that there can be an underscore followed by a digit or none of them. All of it is surrounded by a capture group (...) which will save it as the third value.
  • \] : the literal character ] scaped again to specify that we want the literal character and we are not using it as a special character.

You could also simplify it a bit more:

r"([-\w://.]+)[\s\d]+Port '(\w+)' is not registered \[Hierarchy: 'block_ram_inst\.(block_ram_top(?:_\d)?)'\]"

See it in action here

The simplification just doesn't match the part before the path and groups \s+\d+\s+\d+\s+ as [\s\d]+ which just considers whitespaces and digits as a whole as we don't need to capture the digits.

In the comments another modifcation was asked:

r"([-\w://.]+)[\s\d]+Port '(\w+)' is not registered \[Hierarchy: 'block_ram_inst\.(\w+)'\]"

The only modification was in the third capture group that previously was only taking block_ram_top or block_ram_top_ followed by a single digit and now considers any alphanumerical character with underscores.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM