I've done lots of searching, including this SO post , which almost worked for me.
I'm working with a huge string, trying to capture the groups of four digits that appear after a series of decimal patterns AND before an alphanumeric word.
There are other four digit number groups that don't qualify since they have words or other number patterns before them.
EDIT : my string is not multiline, it is just shown here for visual convenience.
For example:
>> my_string = """BEAVER COUNTY 001 0000
1010 BEAVER
2010 BEAVER COUNTY SCH DIST
0.008504
...(more decimals)
0.008508
4010 COUNTY SPECIAL SERVICE DIST NO.1 <---capture this 4010
4040 BEAVER COUNTY
8005 GREENVILLE SOLAR
0.004258
0.008348
...(more decimals)
0.008238
4060 SPECIAL SERVICE DISTRICT NO 7 <---capture this 4060
"""
The ideal re.findall
should return:
['4010','4060']
Here are patterns I've tried that are lacking:
re.findall(r'(?=(\d\.\d{6}\s+)(\s+\d{4}\s))', my_string)
# also tried
re.findall("(\s+\d{4}\s+)(?:(?!^\d+\.\d+)[\s\S])*", my_string)
# which gets me a little closer but I'm still not getting what I need.
Thanks in advance!
Just match the float number right before the 4 standalone digits:
r'\d+\.\d+\s+(\d{4})\b'
See this regex demo
import re
p = re.compile(r'\d+\.\d+\s+(\d{4})\b')
s = "BEAVER COUNTY 001 0000 1010 BEAVER 2010 BEAVER COUNTY SCH DIST 0.008504 0.008508 4010 COUNTY SPECIAL SERVICE DIST NO.1 4040 BEAVER COUNTY 8005 GREENVILLE SOLAR 0.004258 0.008348 0.008238 4060 SPECIAL SERVICE DISTRICT NO 7"
print(p.findall(s))
# => ['4010', '4060']
You may use a regex that will check for a float value on the previous line and then captures the standalone 4 digits on the next line:
re.compile(r'^\d+\.\d+ *[\r\n]+(\d{4})\b', re.M)
See regex demo here
Pattern explanation :
^
- start of a line (as re.M
is used) \\d+\\.\\d+
- 1+ digits, .
and again 1 or more digits *
- zero or more spaces (replace with [^\\S\\r\\n]
to only match horizontal whitespace) [\\r\\n]+
- 1 or more LF or CR symbols ( to only restrict to 1 linebreak, replace with (?:\\r?\\n|\\r)
) (\\d{4})\\b
- Group 1 returned by the re.findall
matching 4 digits followed with a word boundary (a non-digit, non-letter, non- _
). import re
p = re.compile(r'^\d+\.\d+ *[\r\n]+(\d{4})\b', re.MULTILINE)
s = "BEAVER COUNTY 001 0000 \n1010 BEAVER \n2010 BEAVER COUNTY SCH DIST \n0.008504 \n...(more decimals)\n0.008508 \n4010 COUNTY SPECIAL SERVICE DIST NO.1 <---capture this 4010\n4040 BEAVER COUNTY \n8005 GREENVILLE SOLAR\n0.004258 \n0.008348 \n...(more decimals)\n0.008238 \n4060 SPECIAL SERVICE DISTRICT NO 7 <---capture this 4060"
print(p.findall(s)) # => ['4010', '4060']
This will help you:
"((\d+\.\d+)\s+)+(\d+)\s?(?=\w+)"gm
use group three means \\3
Try this patter:
re.compile(r'(\d+[.]\d+)+\s+(?P<cap>\d{4})\s+\w+')
I wrote a little code and checked against it and it works.
import re
p=re.compile(r'(\d+[.]\d+)+\s+(?P<cap>\d{4})\s+\w+')
my_string = """BEAVER COUNTY 001 0000
1010 BEAVER
2010 BEAVER COUNTY SCH DIST
0.008504
...(more decimals)
0.008508
4010 COUNTY SPECIAL SERVICE DIST NO.1 <---capture this 4010
4040 BEAVER COUNTY
8005 GREENVILLE SOLAR
0.004258
0.008348
...(more decimals)
0.008238
4060 SPECIAL SERVICE DISTRICT NO 7 <---capture this 4060
"""
s=my_string.replace("\n", " ")
match=p.finditer(s)
for m in match:
print m.group('cap')
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.