简体   繁体   中英

python regex gives empty string

First off, I am new to regex. But so far I am in love with them. I am using regex to extract info from an image files name that I get from render engine. So far this regex is working decently...

_([a-z]{2,8})_?(\d{1,2})?(\.|_)(\d{3,10})\.([a-z]{2,6})$

If I use the split() method on a file name such as...

image_file_name_ao.0001.exr

I get back I nice little list I can use....

['image_file_name', 'gi', None, '.', '0001', 'exr', '']

My only concern is that it always returns an empty string last. No matter how I change or manipulate the regex it always gives me an empty string at the end of the list. I am totally comfortable with ignoring it and moving on, but my question is am I doing something wrong with my regex or is there something I can do to make it not pass that final empty string? Thank you for your time.

No wonder. The split method splits your string at occurences of the regex (plus returns group ranges). And since your regex matches only substrings which reach until the end of the line (indicated by the $ at its end), there is nothing to split off at the file name's end but an empty suffix ( '' ).

Given that you are already using groups " (...) " in your expression, you could as well use re.match(regex, string) . This will give you a MatchObject instance, from which you can retrieve a tuple containing your groups via groups() :

# additional group up front
reg='(\S*)_([a-z]{2,8})_?(\d{1,2})?(\.|_)(\d{3,10})\.([a-z]{2,6})$' 
print re.match(reg, filename).groups() # request tuple of group matches

Edit: I'm really sorry but I didn't realize that your pattern does not match the file name string from its first character on. I extended it in my answer. If you want to stick with your approach using split() , you might also change your original pattern in a way that the last part of the file name is not matched and hence split off.

Interesting question.

I changed a little the regex's pattern:

import re

reg = re.compile('_([a-z]{2,8})'

                 '_?(\d\d?)?'

                 '([._])'
                 '(\d{3,10})'
                 '\.'
                 '(?=[a-z]{2,6}$)')

for ss in ('image_file_name_ao.0001.exr',
           'image_file_name_45_ao.0001.exr',
           'image_file_name_ao_78.0001.exr',
           'image_file_name_ao78.0001.exr'):
    print '%s\n%r\n' % ( ss, reg.split(ss) )

result

image_file_name_ao.0001.exr
['image_file_name', 'ao', None, '.', '0001', 'exr']

image_file_name_45_ao.0001.exr
['image_file_name_45', 'ao', None, '.', '0001', 'exr']

image_file_name_ao_78.0001.exr
['image_file_name', 'ao', '78', '.', '0001', 'exr']

image_file_name_ao78.0001.exr
['image_file_name', 'ao', '78', '.', '0001', 'exr']

You can use filter()

Given your example this would work like,

def f(x):
    return x != '' 

filter
(
    f,
    re.split('_([a-z]{2,8})_?(\d{1,2})?(\.|_)(\d{3,10})\.([a-z]{2,6})$',
    'image_file_name_ao.0001.exr')
)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM