简体   繁体   中英

Python/Regex - Match .#,#. in String

What regex can I use to match ".#,#." within a string. It may or may not exist in the string. Some examples with expected outputs might be:

Test1.0,0.csv      -> ('Test1', '0,0', 'csv')         (Basic Example)
Test2.wma          -> ('Test2', 'wma')                (No Match)
Test3.1100,456.jpg -> ('Test3', '1100,456', 'jpg')    (Basic with Large Number)
T.E.S.T.4.5,6.png  -> ('T.E.S.T.4', '5,6', 'png')     (Doesn't strip all periods)
Test5,7,8.sss      -> ('Test5,7,8', 'sss')            (No Match)
Test6.2,3,4.png    -> ('Test6.2,3,4', 'png')          (No Match, to many commas)
Test7.5,6.7,8.test -> ('Test7', '5,6', '7,8', 'test') (Double Match?)

The last one isn't too important and I would only expect that .#,#. would appear once. Most files I'm processing, I would expect to fall into the first through fourth examples, so I'm most interested in those.

Thanks for the help!

You can use the regex \\.\\d+,\\d+\\. to find all matches for that pattern, but you will need to do a little extra to get the output you expect, especially since you want to treat .5,6.7,8. as two matches.

Here is one potential solution:

def transform(s):
    s = re.sub(r'(\.\d+,\d+)+\.', lambda m: m.group(0).replace('.', '\n'), s)
    return tuple(s.split('\n'))

Examples:

>>> transform('Test1.0,0.csv')
('Test1', '0,0', 'csv')
>>> transform('Test2.wma')
('Test2.wma',)
>>> transform('Test3.1100,456.jpg')
('Test3', '1100,456', 'jpg')
>>> transform('T.E.S.T.4.5,6.png')
('T.E.S.T.4', '5,6', 'png')
>>> transform('Test5,7,8.sss')
('Test5,7,8.sss',)
>>> transform('Test6.2,3,4.png')
('Test6.2,3,4.png',)
>>> transform('Test7.5,6.7,8.test')
('Test7', '5,6', '7,8', 'test')

To also get the file extension split off when there are no matches, you can use the following:

def transform(s):
    s = re.sub(r'(\.\d+,\d+)+\.', lambda m: m.group(0).replace('.', '\n'), s)
    groups = s.split('\n')
    groups[-1:] = groups[-1].rsplit('.', 1)
    return tuple(groups)

This will be the same output as above except that 'Test2.wma' becomes ('Test2', 'wma') , with similar behavior for 'Test5,7,8.sss' and 'Test5,7,8.sss' .

To allow for multiple consecutive matches, use lookahead/lookbehind:

r'(?<=\.)\d+,\d+(?=\.)'

Example:

>>> re.findall(r'(?<=\.)\d+,\d+(?=\.)', 'Test7.5,6.7,8.test')
['5,6', '7,8']

We can also use lookahead to perform the split as you want it:

import re
def split_it(s):
    pieces = re.split(r'\.(?=\d+,\d+\.)', s)
    pieces[-1:] = pieces[-1].rsplit('.', 1) # split off extension
    return pieces

Testing:

>>> print split_it('Test1.0,0.csv')
['Test1', '0,0', 'csv']
>>> print split_it('Test2.wma')
['Test2', 'wma']
>>> print split_it('Test3.1100,456.jpg')
['Test3', '1100,456', 'jpg']
>>> print split_it('T.E.S.T.4.5,6.png')
['T.E.S.T.4', '5,6', 'png']
>>> print split_it('Test5,7,8.sss')
['Test5,7,8', 'sss']
>>> print split_it('Test6.2,3,4.png')
['Test6.2,3,4', 'png']
>>> print split_it('Test7.5,6.7,8.test')
['Test7', '5,6', '7,8', 'test']

Use regex pattern ^([^,]+)\\.(\\d+,\\d+)\\.([^,.]+)$

Check this demo >>

>>> print re.findall(r'^([^,]+)\.(\d+,\d+)\.([^,.]+)$', 'Test1.0,0.csv')
[('Test1', '0,0', 'csv')]

>>> print re.findall(r'^([^,]+)\.(\d+,\d+)\.([^,.]+)$', 'Test2.wma')
[]

>>> print re.findall(r'^([^,]+)\.(\d+,\d+)\.([^,.]+)$', 'Test3.1100,456.jpg')
[('Test3', '1100,456', 'jpg')]

>>> print re.findall(r'^([^,]+)\.(\d+,\d+)\.([^,.]+)$', 'T.E.S.T.4.5,6.png')
[('T.E.S.T.4', '5,6', 'png')]

>>> print re.findall(r'^([^,]+)\.(\d+,\d+)\.([^,.]+)$', 'Test5,7,8.sss')
[]

>>> print re.findall(r'^([^,]+)\.(\d+,\d+)\.([^,.]+)$', 'Test6.2,3,4.png')
[]

>>> print re.findall(r'^([^,]+)\.(\d+,\d+)\.([^,.]+)$', 'Test7.5,6.7,8.test') 
[]
'/^(.+)\.((\d+,\d+)\.)?(.+)$/'

The third capturing group should contain the pair of numbers. If you have multiple of those pairs, you should get multiple matches. And the third capturing would always contain the pair.

^(.*?)\.(\d+,\d+)\.(.*?)$

This passes your tests, at least in Patterns:

通过模式中的测试

这非常接近,python是否支持命名组?

^.*(?P<group1>\d+(?:,\d+)?)\.(?P<group2>\d+(?:,\d+)?).*\..+$

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM