So I've been playing around with this for a few days and here is what I am looking for and the regex I have now. I have a file in this format (there are some other fields but I have omitted those:
I just want to match the bold text
ADDR 1 - XXXXXX ADDR 1 - **XXXXXX**
ADDR 2 - XXXXXX ADDR 2 - XXXXXX
ADDR 1 - XXXXXX ADDR 1 - **XXXXXX**
ADDR 2 - XXXXXX ADDR 2 - XXXXXX
The regex I have written only matches the first ADDR 1 - XXXXX, but I need to match all instances of the bolded XXXXX.
re.findall(r'ADDR 1- .*? ADDR 1-(.*?)(?=ADDR 2-)', lines, re.DOTALL)
Any suggestions? I feel like I might be missing something simple, but not sure.
Code:
import re
str= """
ADDR 1 - XXXXXX ADDR 1 - ABCDEF
ADDR 2 - XXXXXX ADDR 2 - XXXXXX
ADDR 1 - XXXXXX ADDR 1 - UVWXYZ
ADDR 2 - XXXXXX ADDR 2 - XXXXXX
"""
m = re.findall(r".*ADDR\s+1\s+-\s+(.*)",str)
print m
Output:
C:\Users\dinesh_pundkar\Desktop>python c.py
['ABCDEF', 'UVWXYZ']
C:\Users\dinesh_pundkar\Desktop>
How it works:
.*ADDR\s+1\s+-\s+(.*)
Lets take a line - ADDR 1 - XXXXXX ADDR 1 - ABCDEF
.*ADDR
will match ADDR 1 - XXXXXX ADDR . Since .*
match anything and by nature regex are greedy, so to stop I have add ADDR
after .*
\\s+1\\s+-\\s+(.*)
will match rest 1 - ABCDEF . \\s+1\\s+-\\s+
is required since we need to match ADDR 1 and not ADDR 2 . (.*)
will match ABCDEF and store it. If wanting to capture every other instance of something then splitting or slicing the string is going to be much faster than using regex — the following demonstrates a very basic example:
split() method:
>>> [i.split('ADDR 1 - ')[-1] for i in s.split('\n')[::2]]
>>> ['AXXXXZ', 'AXXXXY']
>>> ''' 18.3057999611 seconds - 10000000 iterations '''
findall() method:
>>> re.findall(".*ADDR\s+1\s+-\s+(.*)", s)
>>> ['AXXXXZ', 'AXXXXY']
>>> ''' 77.5003650188 seconds - 10000000 iterations '''
In situations where you know regex isn't absolutely necessary consider using an alternative. Also the regex shown in the accepted answer could be optimized to cut the time nearly in half (eg. re.findall("ADDR 1 .+ - (.+)", s
) - 37.0185003658 seconds - 10000000 iterations
.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.