简体   繁体   English

正则表达式捕获超过4位数的任何数字之前的重叠匹配

[英]regex to capture overlapping matches preceding any number with more than 4 digits

I am writing a regular expression to pick 30 characters present before a number which has more than 4 digits in below text. 我正在写一个正则表达式,在下面的文本中有一个超过4位的数字之前选择30个字符。 Here is my code: 这是我的代码:

text = "I went and I bought few tickets and ticket numbers 100000,100001 and 100002.I bought them for 200,300 and 400 USD. Box office collections were 55555555 USD"

reg=".{0,30}(?:[\d]+[ .]?){5,}"
regc=re.compile(reg)
res=regc.findall(text)

This is giving below partial results 这给出了以下部分结果

在此输入图像描述

I am getting 30 characters before 100000 only. 我在100000之前得到30个字符。

How do I get 30 characters before 100001 and how do I also get 30 characters before 100002? 如何在100001之前获得30个字符?如何在100002之前获得30个字符?

You are looking for any 30 chars in front except line breaks, ?= positive look ahead, but not including in the catching group 除了换行符,你正在寻找前面的任何30个字符,?=正向前看,但不包括在捕捉组中

/.{30}(?=100001)/g

https://regexr.com/4293v https://regexr.com/4293v

Since you need overlapping matches, you need to use lookarounds. 由于您需要重叠匹配,因此需要使用外观。 However, lookbehinds in re are of fixed width, so, you may utilize a hack: reverse the string, use a regex with a lookahead, and then reverse the matches: 但是, re中的lookbehinds具有固定宽度,因此,您可以使用hack:反转字符串,使用前瞻的正则表达式,然后反转匹配:

import re
rev_rx = r'((?:\d+[ .]?){5,})(?=(.{0,30}))'
text="I went and I bought few tickets and ticket numbers 100000,100001 and 100002.I bought them for 200,300 and 400 USD. Box office collections were 55555555 USD"
results = [ "{}{}".format(y[::-1], x[::-1]) for x, y in re.findall(rev_rx, text[::-1]) ]
print(results)
# => ['D. Box office collections were 55555555', 'cket numbers 100000,100001 and 100002', 'ets and ticket numbers 100000,100001', 'few tickets and ticket numbers 100000']

See the Python demo . 请参阅Python演示

The ((?:\\d+[ .]?){5,})(?=(.{0,30})) regex matches and captures into Group 1 five or more sequences of 1+ digits and an optional space or comma. ((?:\\d+[ .]?){5,})(?=(.{0,30}))正则表达式匹配并捕获组1中五个或更多1+位序列和一个可选空格或逗号。 Then, the positive lookahead checks if there are 0 to 30 chars in the string. 然后,正向前瞻检查字符串中是否有0到30个字符。 The substring is captured into Group 2. So, all you need is concatenate reversed Group 2 and Group 1 values to get the matches you need. 子字符串被捕获到第2组中。因此,您只需连接反向的第2组和第1组值即可获得所需的匹配项。

You can do this by combining some simple regex with string methods to get the 30 characters that precede any number with more than 4 digits (rather than using more complex regex to both find the matches and capture the desired characters). 您可以通过将一些简单的正则表达式与字符串方法结合使用来获得超过4位数的任何数字前面的30个字符(而不是使用更复杂的正则表达式来查找匹配项并捕获所需的字符)。

The example below uses regex to find all the numbers with more than 4 digits, then uses str.find() to get the position of each match in the original text so you can slice the preceding 30 characters: 下面的示例使用正则表达式查找超过4位的所有数字,然后使用str.find()获取原始文本中每个匹配的位置,以便您可以切片前30个字符:

import re

text = "I went and I bought few tickets and ticket numbers 100000,100001 and 100002.I bought them for 200,300 and 400 USD. Box office collections were 55555555 USD"

patt = re.compile(r'\d{5,}')
nums = patt.findall(text)
matches = [text[:text.find(n)][-30:] for n in nums]

print(matches)
# OUTPUT (shown on multiple lines for readability)
# [
#     'ew tickets and ticket numbers ',
#     'ets and ticket numbers 100000,',
#     'ket numbers 100000,100001 and ',
#     '. Box office collections were '
# ]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM