regex to capture overlapping matches preceding any number with more than 4 digits

Question

I am writing a regular expression to pick 30 characters present before a number which has more than 4 digits in below text. Here is my code:

text = "I went and I bought few tickets and ticket numbers 100000,100001 and 100002.I bought them for 200,300 and 400 USD. Box office collections were 55555555 USD"

reg=".{0,30}(?:[\d]+[ .]?){5,}"
regc=re.compile(reg)
res=regc.findall(text)

This is giving below partial results

I am getting 30 characters before 100000 only.

How do I get 30 characters before 100001 and how do I also get 30 characters before 100002?

Answer 1

You are looking for any 30 chars in front except line breaks, ?= positive look ahead, but not including in the catching group

/.{30}(?=100001)/g

https://regexr.com/4293v

Answer 2

Since you need overlapping matches, you need to use lookarounds. However, lookbehinds in re are of fixed width, so, you may utilize a hack: reverse the string, use a regex with a lookahead, and then reverse the matches:

import re
rev_rx = r'((?:\d+[ .]?){5,})(?=(.{0,30}))'
text="I went and I bought few tickets and ticket numbers 100000,100001 and 100002.I bought them for 200,300 and 400 USD. Box office collections were 55555555 USD"
results = [ "{}{}".format(y[::-1], x[::-1]) for x, y in re.findall(rev_rx, text[::-1]) ]
print(results)
# => ['D. Box office collections were 55555555', 'cket numbers 100000,100001 and 100002', 'ets and ticket numbers 100000,100001', 'few tickets and ticket numbers 100000']

See the Python demo .

The ((?:\\d+[ .]?){5,})(?=(.{0,30})) regex matches and captures into Group 1 five or more sequences of 1+ digits and an optional space or comma. Then, the positive lookahead checks if there are 0 to 30 chars in the string. The substring is captured into Group 2. So, all you need is concatenate reversed Group 2 and Group 1 values to get the matches you need.

Answer 3

You can do this by combining some simple regex with string methods to get the 30 characters that precede any number with more than 4 digits (rather than using more complex regex to both find the matches and capture the desired characters).

The example below uses regex to find all the numbers with more than 4 digits, then uses str.find() to get the position of each match in the original text so you can slice the preceding 30 characters:

import re

text = "I went and I bought few tickets and ticket numbers 100000,100001 and 100002.I bought them for 200,300 and 400 USD. Box office collections were 55555555 USD"

patt = re.compile(r'\d{5,}')
nums = patt.findall(text)
matches = [text[:text.find(n)][-30:] for n in nums]

print(matches)
# OUTPUT (shown on multiple lines for readability)
# [
#     'ew tickets and ticket numbers ',
#     'ets and ticket numbers 100000,',
#     'ket numbers 100000,100001 and ',
#     '. Box office collections were '
# ]

regex to capture overlapping matches preceding any number with more than 4 digits

Question

3 answers

solution1
0 2018-10-31 13:34:27

solution2
0 2018-10-31 13:35:16

solution3
0 2018-10-31 13:36:33

regex to capture overlapping matches preceding any number with more than 4 digits

Question

3 answers

solution1 0 2018-10-31 13:34:27

solution2 0 2018-10-31 13:35:16

solution3 0 2018-10-31 13:36:33

solution1
0 2018-10-31 13:34:27

solution2
0 2018-10-31 13:35:16

solution3
0 2018-10-31 13:36:33