简体   繁体   中英

Python Regex match all occurrences of decimal pattern followed by another pattern

I've done lots of searching, including this SO post , which almost worked for me.

I'm working with a huge string, trying to capture the groups of four digits that appear after a series of decimal patterns AND before an alphanumeric word.

There are other four digit number groups that don't qualify since they have words or other number patterns before them.

EDIT : my string is not multiline, it is just shown here for visual convenience.

For example:

>> my_string = """BEAVER COUNTY 001 0000 
1010 BEAVER 
2010 BEAVER COUNTY SCH DIST 
0.008504 
...(more decimals)
0.008508 
4010 COUNTY SPECIAL SERVICE DIST NO.1   <---capture this 4010
4040 BEAVER COUNTY 
8005 GREENVILLE SOLAR
0.004258 
0.008348 
...(more decimals)
0.008238 
4060 SPECIAL SERVICE DISTRICT NO 7   <---capture this 4060
"""

The ideal re.findall should return:

['4010','4060']

Here are patterns I've tried that are lacking:

re.findall(r'(?=(\d\.\d{6}\s+)(\s+\d{4}\s))', my_string)
# also tried         
re.findall("(\s+\d{4}\s+)(?:(?!^\d+\.\d+)[\s\S])*", my_string)
# which gets me a little closer but I'm still not getting what I need.

Thanks in advance!

SINGLE LINE STRING APPROACH:

Just match the float number right before the 4 standalone digits:

r'\d+\.\d+\s+(\d{4})\b'

See this regex demo

Python demo :

import re
p = re.compile(r'\d+\.\d+\s+(\d{4})\b')
s = "BEAVER COUNTY 001 0000 1010 BEAVER 2010 BEAVER COUNTY SCH DIST 0.008504 0.008508 4010 COUNTY SPECIAL SERVICE DIST NO.1 4040 BEAVER COUNTY 8005 GREENVILLE SOLAR 0.004258 0.008348 0.008238 4060 SPECIAL SERVICE DISTRICT NO 7"
print(p.findall(s))
# => ['4010', '4060']

ORIGINAL ANSWER: MULTILINE STRING

You may use a regex that will check for a float value on the previous line and then captures the standalone 4 digits on the next line:

re.compile(r'^\d+\.\d+ *[\r\n]+(\d{4})\b', re.M)

See regex demo here

Pattern explanation :

  • ^ - start of a line (as re.M is used)
  • \\d+\\.\\d+ - 1+ digits, . and again 1 or more digits
  • * - zero or more spaces (replace with [^\\S\\r\\n] to only match horizontal whitespace)
  • [\\r\\n]+ - 1 or more LF or CR symbols ( to only restrict to 1 linebreak, replace with (?:\\r?\\n|\\r) )
  • (\\d{4})\\b - Group 1 returned by the re.findall matching 4 digits followed with a word boundary (a non-digit, non-letter, non- _ ).

Python demo :

import re
p = re.compile(r'^\d+\.\d+ *[\r\n]+(\d{4})\b', re.MULTILINE)
s = "BEAVER COUNTY 001 0000 \n1010 BEAVER \n2010 BEAVER COUNTY SCH DIST \n0.008504 \n...(more decimals)\n0.008508 \n4010 COUNTY SPECIAL SERVICE DIST NO.1   <---capture this 4010\n4040 BEAVER COUNTY \n8005 GREENVILLE SOLAR\n0.004258 \n0.008348 \n...(more decimals)\n0.008238 \n4060 SPECIAL SERVICE DISTRICT NO 7   <---capture this 4060"
print(p.findall(s)) # => ['4010', '4060']

This will help you:

"((\d+\.\d+)\s+)+(\d+)\s?(?=\w+)"gm

use group three means \\3

Demo And Explaination

Try this patter:

re.compile(r'(\d+[.]\d+)+\s+(?P<cap>\d{4})\s+\w+')

I wrote a little code and checked against it and it works.

import re

p=re.compile(r'(\d+[.]\d+)+\s+(?P<cap>\d{4})\s+\w+')

my_string = """BEAVER COUNTY 001 0000 
1010 BEAVER 
2010 BEAVER COUNTY SCH DIST 
0.008504 
...(more decimals)
0.008508 
4010 COUNTY SPECIAL SERVICE DIST NO.1   <---capture this 4010
4040 BEAVER COUNTY 
8005 GREENVILLE SOLAR
0.004258 
0.008348 
...(more decimals)
0.008238 
4060 SPECIAL SERVICE DISTRICT NO 7   <---capture this 4060
"""

s=my_string.replace("\n", " ")

match=p.finditer(s)

for m in match:
    print m.group('cap')

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM