简体   繁体   中英

Python regex to match word before extension

I have this list of links:

 ['/directory/index.html',
 '/index.html',
 '#',
 '/index.html',
 '/kss_how.html',
 'dr_info/swearingenlarry.html',
 'dr_info/swearingenlarrylast.html',
 'dr_info/kingjohn.html',
 'dr_info/kingjohnlast.html',
 'dr_info/_coble.jpg',
 'dr_info/coblebillielast.html',
 'dr_info/netherystephen.jpg',
 'dr_info/netherystephenlast.html',
 'dr_info/rougeaupaul.jpg',
 'dr_info/no_last_statement.html',
 'dr_info/no_info_available.html',
 'dr_info/no_last_statement.html',
 'dr_info/no_last_statement.html']

which I need to select links like

'dr_info/kingjohn.html'

from and skip the rest.

So far I came up only with very inefficient solution:

p_1 = re.compile('dr.*(?<!last).html')
p_1_links = list(filter(p_1.match, links))

p_2 = re.compile('dr.*(?<!statement).html')
p_2_links = list(filter(p_2.match, p_1_links))

p_3 = re.compile('dr.*(?<!available).html')
valid_links = list(filter(p_3.match, p_2_links))

which makes me shiver and I hope some one can help me to fit it in one line.

Desired output from example would be like this:

['dr_info/swearingenlarry.html',
 'dr_info/kingjohn.html']

Only links starting with dr_info and ending with html No links with last , no_last_statement or no_info_available

Use

exceptions = ('last.html', 'statement.html', 'available.html')
links = [link for link in links if link.endswith('.html') and link.startswith('dr') and not link.endswith(exceptions)]
# => ['dr_info/swearingenlarry.html', 'dr_info/kingjohn.html']

See Python demo

The link.endswith('.html') and link.startswith('dr') and not link.endswith(exceptions) filters the links list keeping all those that start with dr , end with .html and do not end with any value in exceptions tuple.

For the educational purposes, the regex solution can look like

rx = re.compile(r'dr.*(?<!last)(?<!statement)(?<!available)\.html')
links = list(filter(rx.fullmatch, links))

See the Python demo and the regex demo .

You can't use the three exceptions in a single lookbehind separated with | alternation operators because Python lookbehinds are fixed-width . The .fullmatch method will ensure the whole string matches the regex, thus, no anchors are required.

Update:

To avoid matching links where the excluded words come right after dr (as addressed in the comments ) and assuming you only want to match the full link, you may use the following pattern:

^dr(?!.*(?:last|statement|available)).*\.html$

Demo .


Original answer:

You may use a negative Lookahead (instead of a negative Lookbehind) so that you can use alternation. Try something like this:

dr(?:.(?!last|statement|available))*\.html

Regex demo .

Python example:

import re

links = ['/directory/index.html',
 '/index.html',
 '#',
 '/index.html',
 '/kss_how.html',
 'dr_info/swearingenlarry.html',
 'dr_info/swearingenlarrylast.html',
 'dr_info/kingjohn.html',
 'dr_info/kingjohnlast.html',
 'dr_info/_coble.jpg',
 'dr_info/coblebillielast.html',
 'dr_info/netherystephen.jpg',
 'dr_info/netherystephenlast.html',
 'dr_info/rougeaupaul.jpg',
 'dr_info/no_last_statement.html',
 'dr_info/no_info_available.html',
 'dr_info/no_last_statement.html',
 'dr_info/no_last_statement.html']

p_1 = re.compile('dr(?:.(?!last|statement|available))*\.html')
p_1_links = list(filter(p_1.match, links))

print(p_1_links)

Output:

['dr_info/swearingenlarry.html', 'dr_info/kingjohn.html']

Try it online .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM