I am trying to write out a code in Python that would match a pattern of text and be saved in a list.
Below are the example 3 lines from a text file:
FY20 Jan 8 Special Buy Event 592586642 - Dummy text Dummy text
Dummy text Dummy text Dummy text - 592586642, Dummy text Dummy text
FY20 Last Minute Gifts (Next Day/PUT) "364706825 - dummy text dummy text dummy text dummy text dummy text dummy text dummy text - 364706825 dummy text
FY20 Early Access 484015830 dummy text dummy text dummy text dummy text dummy text dummy text - 484015830 dummy text
Below is the regex that I used:
with open('test.txt', encoding="utf8") as f:
promo = []
item = []
for line in f:
#yo = re.findall('(FY20[\s\w]+)\t([0-9]+)', line)
yo = re.findall('(FY20[^\d+]*)+([0-9]*)', line)
try:
promo.append(yo[0][0])
item.append(yo[0][1])
except:
continue
The above code matches everything before occurrence of a number. It works fine for the last 2 lines and saves the proper results--promo type and item number in the list. However, for the first line, it only matches the number "8" and gives out an empty list for item
item = ['', '364706825','484015830']
promo = ['FY20 Jan\t', 'FY20 Jan 8 Special Buy Event\t','FY20 Last Minute Gifts (Next Day/PUT)\t', 'FY20 Early Access\t']
I want the regex to match everything before a certain range of digits occur.
item = ['592586642', '364706825','484015830']
promo = ['FY20 Jan 8\t', 'FY20 Jan 8 Special Buy Event\t','FY20 Last Minute Gifts (Next Day/PUT)\t', 'FY20 Early Access\t']
Do not worry about cleaning the results, I just need the proper results for now
I have tried using (FY20[^\d+]*)+([0-9]*)
and (FY20[^\\d{3,18}]*)+([0-9]*)
and many others but did not find a way to get through everything. Do I have to use a conditional if-else
statements to match this pattern?
You can try to practice regex patterns with your examples on debuggex.com , Regular Expression (?P<promo>.*?)(?P<item>\d{3,18}).*
.
Try groups pattern, example, with groupdict()
:
Code :
with open('test.txt', encoding="utf8") as f:
text = f.read()
promo = []
item = []
p = re.compile('(?P<promo>.*?)(?P<item>\d{3,18}).*')
for t in text.split('\n'):
res = p.search(t)
if res is not None:
res_dict = res.groupdict()
promo.append(res_dict['promo'])
item.append(res_dict['item'])
print (promo)
print (item)
Use \d{2}\d+
- for 3 or more digits or \d{3,18}
- for 3 to 18 digits if you want and read about pythonre module . groups()
groupdict()
is not mandatory but simpler maintain long regex .
Use this regex:
Regex:
FY20(.*?)(\d{3,18})
Demo: Here
Python Sample:
import re
text = '''
FY20 Jan 8 Special Buy Event 592586642 - Dummy text Dummy text Dummy text Dummy text Dummy text - 592586642, Dummy text Dummy text
FY20 Last Minute Gifts (Next Day/PUT) "364706825 - dummy text dummy text dummy text dummy text dummy text dummy text dummy text - 364706825 dummy text
FY20 Early Access 484015830 dummy text dummy text dummy text dummy text dummy text dummy text - 484015830 dummy text
'''
res = re.findall(r'FY20(.*?)(\d{3,18})',text)
print(res)
Demo: Here
Output:
[(' Jan 8 Special Buy Event ', '592586642'), (' Last Minute Gifts (Next Day/PUT) "', '364706825'), (' Early Access ', '484015830')]
PS: To include FY20
use this regex (FY20.*?)\d{3,18}
This works for me:
>>> text = '''
... FY20 Jan 8 Special Buy Event 592586642 - Dummy text Dummy text Dummy text Dummy text Dummy text - 592586642, Dummy text Dummy text
... FY20 Last Minute Gifts (Next Day/PUT) "364706825 - dummy text dummy text dummy text dummy text dummy text dummy text dummy text - 364706825 dummy text
... FY20 Early Access 484015830 dummy text dummy text dummy text dummy text dummy text dummy text - 484015830 dummy text
... '''
>>> text = [t for t in text.split('\n') if len(t) > 10]
>>> text
['FY20 Jan 8 Special Buy Event 592586642 - Dummy text Dummy text Dummy text Dummy text Dummy text - 592586642, Dummy text Dummy text', 'FY20 Last Minute Gifts (Next Day/PUT) "364706825 - dummy text dummy text dummy text dummy text dummy text dummy text dummy text - 364706825 dummy text', 'FY20 Early Access 484015830 dummy text dummy text dummy text dummy text dummy text dummy text - 484015830 dummy text']
>>> for t in text :
... re.findall( r'\d{3,18}', t )
...
['592586642', '592586642']
['364706825', '364706825']
['484015830', '484015830']
>>> for t in text :
... pattern = re.findall( r'\d{3,18}', t )
... print t[:t.find(pattern[0])]
...
FY20 Jan 8 Special Buy Event
FY20 Last Minute Gifts (Next Day/PUT) "
FY20 Early Access
>>>
I use re
to find the number you need, then just a simple string manipulation to find this pattern and print out the result.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.