简体   繁体   中英

How to match everything before the first occurrence of a 3 to 18 digit number using Regex?

I am trying to write out a code in Python that would match a pattern of text and be saved in a list.

Below are the example 3 lines from a text file:

FY20 Jan 8 Special Buy Event    592586642 - Dummy text Dummy text 
Dummy text Dummy text Dummy text - 592586642, Dummy text Dummy text

FY20 Last Minute Gifts (Next Day/PUT)   "364706825 - dummy text dummy text dummy text dummy text dummy text dummy text dummy text - 364706825 dummy text

FY20 Early Access   484015830 dummy text dummy text dummy text dummy text dummy text dummy text - 484015830 dummy text

Below is the regex that I used:

with open('test.txt', encoding="utf8") as f:
    promo = []
    item = []
    for line in f:
        #yo = re.findall('(FY20[\s\w]+)\t([0-9]+)', line)
        yo = re.findall('(FY20[^\d+]*)+([0-9]*)', line)
        try:
            promo.append(yo[0][0])
            item.append(yo[0][1])
        except:
            continue

The above code matches everything before occurrence of a number. It works fine for the last 2 lines and saves the proper results--promo type and item number in the list. However, for the first line, it only matches the number "8" and gives out an empty list for item

item = ['', '364706825','484015830']
promo = ['FY20 Jan\t', 'FY20 Jan 8 Special Buy Event\t','FY20 Last Minute Gifts (Next Day/PUT)\t', 'FY20 Early Access\t']

I want the regex to match everything before a certain range of digits occur.

item = ['592586642', '364706825','484015830']
promo = ['FY20 Jan 8\t', 'FY20 Jan 8 Special Buy Event\t','FY20 Last Minute Gifts (Next Day/PUT)\t', 'FY20 Early Access\t']

Do not worry about cleaning the results, I just need the proper results for now

I have tried using (FY20[^\d+]*)+([0-9]*) and (FY20[^\\d{3,18}]*)+([0-9]*) and many others but did not find a way to get through everything. Do I have to use a conditional if-else statements to match this pattern?

You can try to practice regex patterns with your examples on debuggex.com , Regular Expression (?P<promo>.*?)(?P<item>\d{3,18}).* .

Try groups pattern, example, with groupdict() :

Debuggex Demo

Code :

with open('test.txt', encoding="utf8") as f:
text = f.read()
promo = []
item = []
p = re.compile('(?P<promo>.*?)(?P<item>\d{3,18}).*')
for t in text.split('\n'):
    res = p.search(t)
    if res is not None:
        res_dict = res.groupdict()
        promo.append(res_dict['promo'])
        item.append(res_dict['item'])
print (promo)
print (item)

Use \d{2}\d+ - for 3 or more digits or \d{3,18} - for 3 to 18 digits if you want and read about pythonre module . groups() groupdict() is not mandatory but simpler maintain long regex .

Use this regex:

Regex:

FY20(.*?)(\d{3,18})

Demo: Here

Python Sample:

import re


text = '''
FY20 Jan 8 Special Buy Event 592586642 - Dummy text Dummy text Dummy text Dummy text Dummy text - 592586642, Dummy text Dummy text

FY20 Last Minute Gifts (Next Day/PUT) "364706825 - dummy text dummy text dummy text dummy text dummy text dummy text dummy text - 364706825 dummy text

FY20 Early Access 484015830 dummy text dummy text dummy text dummy text dummy text dummy text - 484015830 dummy text
'''

res = re.findall(r'FY20(.*?)(\d{3,18})',text)
print(res)

Demo: Here

Output:

[(' Jan 8 Special Buy Event ', '592586642'), (' Last Minute Gifts (Next Day/PUT) "', '364706825'), (' Early Access ', '484015830')]

PS: To include FY20 use this regex (FY20.*?)\d{3,18}

This works for me:

>>> text = '''
... FY20 Jan 8 Special Buy Event 592586642 - Dummy text Dummy text Dummy text Dummy text Dummy text - 592586642, Dummy text Dummy text
... FY20 Last Minute Gifts (Next Day/PUT) "364706825 - dummy text dummy text dummy text dummy text dummy text dummy text dummy text - 364706825 dummy text
... FY20 Early Access 484015830 dummy text dummy text dummy text dummy text dummy text dummy text - 484015830 dummy text
... '''
>>> text = [t for t in text.split('\n') if len(t) > 10]
>>> text
['FY20 Jan 8 Special Buy Event 592586642 - Dummy text Dummy text Dummy text Dummy text Dummy text - 592586642, Dummy text Dummy text', 'FY20 Last Minute Gifts (Next Day/PUT) "364706825 - dummy text dummy text dummy text dummy text dummy text dummy text dummy text - 364706825 dummy text', 'FY20 Early Access 484015830 dummy text dummy text dummy text dummy text dummy text dummy text - 484015830 dummy text']
>>> for t in text :
...     re.findall( r'\d{3,18}', t )
... 
['592586642', '592586642']
['364706825', '364706825']
['484015830', '484015830']
>>> for t in text :
...     pattern = re.findall( r'\d{3,18}', t )
...     print t[:t.find(pattern[0])]
... 
FY20 Jan 8 Special Buy Event 
FY20 Last Minute Gifts (Next Day/PUT) "
FY20 Early Access 
>>>

I use re to find the number you need, then just a simple string manipulation to find this pattern and print out the result.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM