简体   繁体   中英

regex to capture overlapping matches preceding any number with more than 4 digits

I am writing a regular expression to pick 30 characters present before a number which has more than 4 digits in below text. Here is my code:

text = "I went and I bought few tickets and ticket numbers 100000,100001 and 100002.I bought them for 200,300 and 400 USD. Box office collections were 55555555 USD"

reg=".{0,30}(?:[\d]+[ .]?){5,}"
regc=re.compile(reg)
res=regc.findall(text)

This is giving below partial results

在此输入图像描述

I am getting 30 characters before 100000 only.

How do I get 30 characters before 100001 and how do I also get 30 characters before 100002?

You are looking for any 30 chars in front except line breaks, ?= positive look ahead, but not including in the catching group

/.{30}(?=100001)/g

https://regexr.com/4293v

Since you need overlapping matches, you need to use lookarounds. However, lookbehinds in re are of fixed width, so, you may utilize a hack: reverse the string, use a regex with a lookahead, and then reverse the matches:

import re
rev_rx = r'((?:\d+[ .]?){5,})(?=(.{0,30}))'
text="I went and I bought few tickets and ticket numbers 100000,100001 and 100002.I bought them for 200,300 and 400 USD. Box office collections were 55555555 USD"
results = [ "{}{}".format(y[::-1], x[::-1]) for x, y in re.findall(rev_rx, text[::-1]) ]
print(results)
# => ['D. Box office collections were 55555555', 'cket numbers 100000,100001 and 100002', 'ets and ticket numbers 100000,100001', 'few tickets and ticket numbers 100000']

See the Python demo .

The ((?:\\d+[ .]?){5,})(?=(.{0,30})) regex matches and captures into Group 1 five or more sequences of 1+ digits and an optional space or comma. Then, the positive lookahead checks if there are 0 to 30 chars in the string. The substring is captured into Group 2. So, all you need is concatenate reversed Group 2 and Group 1 values to get the matches you need.

You can do this by combining some simple regex with string methods to get the 30 characters that precede any number with more than 4 digits (rather than using more complex regex to both find the matches and capture the desired characters).

The example below uses regex to find all the numbers with more than 4 digits, then uses str.find() to get the position of each match in the original text so you can slice the preceding 30 characters:

import re

text = "I went and I bought few tickets and ticket numbers 100000,100001 and 100002.I bought them for 200,300 and 400 USD. Box office collections were 55555555 USD"

patt = re.compile(r'\d{5,}')
nums = patt.findall(text)
matches = [text[:text.find(n)][-30:] for n in nums]

print(matches)
# OUTPUT (shown on multiple lines for readability)
# [
#     'ew tickets and ticket numbers ',
#     'ets and ticket numbers 100000,',
#     'ket numbers 100000,100001 and ',
#     '. Box office collections were '
# ]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM