Python Remove List of Strings from List of Strings

Question

I'm trying to remove several strings from a list of URLs. I have more than 300k URLs, and I'm trying to find which are variations of the original. Here's a toy example that I've been working with.

URLs = ['example.com/page.html',
        'www.example.com/in/page.html',
        'example.com/ca/fr/page.html',
        'm.example.com/de/page.html',
        'example.com/fr/page.html']

locs = ['/in', '/ca', '/de', '/fr', 'm.', 'www.']

What I'd like to end up with is a list of the pages without the language or locations:

desired_output = ['example.com/page.html',
                  'example.com/page.html',
                  'example.com/page.html',
                  'example.com/page.html',
                  'example.com/page.html']

I've tried list comprehension and nested for loops, nothing has worked yet. Can anyone help?

# doesn't remove anything
for item in URLs:
    for string in locs:
        re.sub(string, '', item)

# doesn't remove anything
for item in URLs:
    for string in locs:
        item.strip(string)

# only removes the last string in locs
clean = []
for item in URLs:
    for string in locs:
        new = item.replace(string, '')
    clean.append(new)

Answer 1

You have to assign the result of replace to item again:

clean = []
for item in URLs:
    for loc in locs:
        item = item.replace(loc, '')
    clean.append(item)

or in short:

clean = [
    reduce(lambda item,loc: item.replace(loc,''), [item]+locs)
    for item in URLs
]

Answer 2

The biggest problem you have is that you don't save the return value.

urls = ['example.com/page.html',
        'www.example.com/in/page.html',
        'example.com/ca/fr/page.html',
        'm.example.com/de/page.html',
        'example.com/fr/page.html']

locs = ['/in', '/ca', '/de', '/fr', 'm.', 'www.']

stripped = list(urls) ## create a new copy, not necessary

for loc in locs:
    stripped = [url.replace(loc, '') for url in stripped]

After this, stripped is equal to

['example.com/page.html',
 'example.com/page.html',
 'example.com/page.html',
 'example.com/page.html',
 'example.com/page.html']

EDIT

Alternatively, without creating a new list, you can do

for loc in locs:
    urls = [url.replace(loc, '') for url in urls]

After this, urls is equal to

['example.com/page.html',
 'example.com/page.html',
 'example.com/page.html',
 'example.com/page.html',
 'example.com/page.html']

Answer 3

You could first abstract the removing part into a function and then use a list comprehension:

def remove(target, strings):
    for s in strings:
        target = target.replace(s,'')
    return target

URLs = ['example.com/page.html',
        'www.example.com/in/page.html',
        'example.com/ca/fr/page.html',
        'm.example.com/de/page.html',
        'example.com/fr/page.html']

locs = ['/in', '/ca', '/de', '/fr', 'm.', 'www.']

Used like:

URLs = [remove(url,locs) for url in URLs]

for url in URLs: print(url)

output:

example.com/page.html
example.com/page.html
example.com/page.html
example.com/page.html
example.com/page.html

Python Remove List of Strings from List of Strings

Question

3 answers

solution1
4 ACCPTED 2016-08-31 20:05:22

solution2
3 2016-08-31 20:08:43

solution3
2 2016-08-31 20:10:26

Python Remove List of Strings from List of Strings

Question

3 answers

solution1 4 ACCPTED 2016-08-31 20:05:22

solution2 3 2016-08-31 20:08:43

solution3 2 2016-08-31 20:10:26

solution1
4 ACCPTED 2016-08-31 20:05:22

solution2
3 2016-08-31 20:08:43

solution3
2 2016-08-31 20:10:26