简体   繁体   中英

python regex matching between multiple lines and every other match

So I've been playing around with this for a few days and here is what I am looking for and the regex I have now. I have a file in this format (there are some other fields but I have omitted those:

I just want to match the bold text

ADDR 1 - XXXXXX   ADDR 1 - **XXXXXX**

ADDR 2 - XXXXXX   ADDR 2 - XXXXXX

ADDR 1 - XXXXXX   ADDR 1 - **XXXXXX**

ADDR 2 - XXXXXX   ADDR 2 - XXXXXX

The regex I have written only matches the first ADDR 1 - XXXXX, but I need to match all instances of the bolded XXXXX.

re.findall(r'ADDR 1- .*? ADDR 1-(.*?)(?=ADDR 2-)', lines, re.DOTALL)

Any suggestions? I feel like I might be missing something simple, but not sure.

Code:

import re

str= """
ADDR 1 - XXXXXX ADDR 1 - ABCDEF

ADDR 2 - XXXXXX ADDR 2 - XXXXXX

ADDR 1 - XXXXXX ADDR 1 - UVWXYZ

ADDR 2 - XXXXXX ADDR 2 - XXXXXX
"""

m = re.findall(r".*ADDR\s+1\s+-\s+(.*)",str)
print m

Output:

C:\Users\dinesh_pundkar\Desktop>python c.py
['ABCDEF', 'UVWXYZ']

C:\Users\dinesh_pundkar\Desktop>

How it works:

.*ADDR\s+1\s+-\s+(.*)

正则表达式可视化

Debuggex Demo

Lets take a line - ADDR 1 - XXXXXX ADDR 1 - ABCDEF

  • .*ADDR will match ADDR 1 - XXXXXX ADDR . Since .* match anything and by nature regex are greedy, so to stop I have add ADDR after .*
  • \\s+1\\s+-\\s+(.*) will match rest 1 - ABCDEF . \\s+1\\s+-\\s+ is required since we need to match ADDR 1 and not ADDR 2 . (.*) will match ABCDEF and store it.

If wanting to capture every other instance of something then splitting or slicing the string is going to be much faster than using regex — the following demonstrates a very basic example:

split() method:

>>> [i.split('ADDR 1 - ')[-1] for i in s.split('\n')[::2]]
>>> ['AXXXXZ', 'AXXXXY']
>>> ''' 18.3057999611 seconds - 10000000 iterations '''

findall() method:

>>> re.findall(".*ADDR\s+1\s+-\s+(.*)", s)
>>> ['AXXXXZ', 'AXXXXY']
>>> ''' 77.5003650188 seconds - 10000000 iterations '''

In situations where you know regex isn't absolutely necessary consider using an alternative. Also the regex shown in the accepted answer could be optimized to cut the time nearly in half (eg. re.findall("ADDR 1 .+ - (.+)", s ) - 37.0185003658 seconds - 10000000 iterations .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM