I'm trying to remove several strings from a list of URLs. I have more than 300k URLs, and I'm trying to find which are variations of the original. Here's a toy example that I've been working with.
URLs = ['example.com/page.html',
'www.example.com/in/page.html',
'example.com/ca/fr/page.html',
'm.example.com/de/page.html',
'example.com/fr/page.html']
locs = ['/in', '/ca', '/de', '/fr', 'm.', 'www.']
What I'd like to end up with is a list of the pages without the language or locations:
desired_output = ['example.com/page.html',
'example.com/page.html',
'example.com/page.html',
'example.com/page.html',
'example.com/page.html']
I've tried list comprehension and nested for loops, nothing has worked yet. Can anyone help?
# doesn't remove anything
for item in URLs:
for string in locs:
re.sub(string, '', item)
# doesn't remove anything
for item in URLs:
for string in locs:
item.strip(string)
# only removes the last string in locs
clean = []
for item in URLs:
for string in locs:
new = item.replace(string, '')
clean.append(new)
You have to assign the result of replace
to item
again:
clean = []
for item in URLs:
for loc in locs:
item = item.replace(loc, '')
clean.append(item)
or in short:
clean = [
reduce(lambda item,loc: item.replace(loc,''), [item]+locs)
for item in URLs
]
The biggest problem you have is that you don't save the return value.
urls = ['example.com/page.html',
'www.example.com/in/page.html',
'example.com/ca/fr/page.html',
'm.example.com/de/page.html',
'example.com/fr/page.html']
locs = ['/in', '/ca', '/de', '/fr', 'm.', 'www.']
stripped = list(urls) ## create a new copy, not necessary
for loc in locs:
stripped = [url.replace(loc, '') for url in stripped]
After this, stripped
is equal to
['example.com/page.html',
'example.com/page.html',
'example.com/page.html',
'example.com/page.html',
'example.com/page.html']
EDIT
Alternatively, without creating a new list, you can do
for loc in locs:
urls = [url.replace(loc, '') for url in urls]
After this, urls
is equal to
['example.com/page.html',
'example.com/page.html',
'example.com/page.html',
'example.com/page.html',
'example.com/page.html']
You could first abstract the removing part into a function and then use a list comprehension:
def remove(target, strings):
for s in strings:
target = target.replace(s,'')
return target
URLs = ['example.com/page.html',
'www.example.com/in/page.html',
'example.com/ca/fr/page.html',
'm.example.com/de/page.html',
'example.com/fr/page.html']
locs = ['/in', '/ca', '/de', '/fr', 'm.', 'www.']
Used like:
URLs = [remove(url,locs) for url in URLs]
for url in URLs: print(url)
output:
example.com/page.html
example.com/page.html
example.com/page.html
example.com/page.html
example.com/page.html
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.