I have this list of links:
['/directory/index.html',
'/index.html',
'#',
'/index.html',
'/kss_how.html',
'dr_info/swearingenlarry.html',
'dr_info/swearingenlarrylast.html',
'dr_info/kingjohn.html',
'dr_info/kingjohnlast.html',
'dr_info/_coble.jpg',
'dr_info/coblebillielast.html',
'dr_info/netherystephen.jpg',
'dr_info/netherystephenlast.html',
'dr_info/rougeaupaul.jpg',
'dr_info/no_last_statement.html',
'dr_info/no_info_available.html',
'dr_info/no_last_statement.html',
'dr_info/no_last_statement.html']
which I need to select links like
'dr_info/kingjohn.html'
from and skip the rest.
So far I came up only with very inefficient solution:
p_1 = re.compile('dr.*(?<!last).html')
p_1_links = list(filter(p_1.match, links))
p_2 = re.compile('dr.*(?<!statement).html')
p_2_links = list(filter(p_2.match, p_1_links))
p_3 = re.compile('dr.*(?<!available).html')
valid_links = list(filter(p_3.match, p_2_links))
which makes me shiver and I hope some one can help me to fit it in one line.
Desired output from example would be like this:
['dr_info/swearingenlarry.html',
'dr_info/kingjohn.html']
Only links starting with dr_info
and ending with html
No links with last
, no_last_statement
or no_info_available
Use
exceptions = ('last.html', 'statement.html', 'available.html')
links = [link for link in links if link.endswith('.html') and link.startswith('dr') and not link.endswith(exceptions)]
# => ['dr_info/swearingenlarry.html', 'dr_info/kingjohn.html']
See Python demo
The link.endswith('.html') and link.startswith('dr') and not link.endswith(exceptions)
filters the links
list keeping all those that start with dr
, end with .html
and do not end with any value in exceptions
tuple.
For the educational purposes, the regex solution can look like
rx = re.compile(r'dr.*(?<!last)(?<!statement)(?<!available)\.html')
links = list(filter(rx.fullmatch, links))
See the Python demo and the regex demo .
You can't use the three exceptions in a single lookbehind separated with |
alternation operators because Python lookbehinds are fixed-width . The .fullmatch
method will ensure the whole string matches the regex, thus, no anchors are required.
To avoid matching links where the excluded words come right after dr
(as addressed in the comments ) and assuming you only want to match the full link, you may use the following pattern:
^dr(?!.*(?:last|statement|available)).*\.html$
Demo .
You may use a negative Lookahead (instead of a negative Lookbehind) so that you can use alternation. Try something like this:
dr(?:.(?!last|statement|available))*\.html
Python example:
import re
links = ['/directory/index.html',
'/index.html',
'#',
'/index.html',
'/kss_how.html',
'dr_info/swearingenlarry.html',
'dr_info/swearingenlarrylast.html',
'dr_info/kingjohn.html',
'dr_info/kingjohnlast.html',
'dr_info/_coble.jpg',
'dr_info/coblebillielast.html',
'dr_info/netherystephen.jpg',
'dr_info/netherystephenlast.html',
'dr_info/rougeaupaul.jpg',
'dr_info/no_last_statement.html',
'dr_info/no_info_available.html',
'dr_info/no_last_statement.html',
'dr_info/no_last_statement.html']
p_1 = re.compile('dr(?:.(?!last|statement|available))*\.html')
p_1_links = list(filter(p_1.match, links))
print(p_1_links)
Output:
['dr_info/swearingenlarry.html', 'dr_info/kingjohn.html']
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.