简体   繁体   中英

Python Remove List of Strings from List of Strings

I'm trying to remove several strings from a list of URLs. I have more than 300k URLs, and I'm trying to find which are variations of the original. Here's a toy example that I've been working with.

URLs = ['example.com/page.html',
        'www.example.com/in/page.html',
        'example.com/ca/fr/page.html',
        'm.example.com/de/page.html',
        'example.com/fr/page.html']

locs = ['/in', '/ca', '/de', '/fr', 'm.', 'www.']

What I'd like to end up with is a list of the pages without the language or locations:

desired_output = ['example.com/page.html',
                  'example.com/page.html',
                  'example.com/page.html',
                  'example.com/page.html',
                  'example.com/page.html']

I've tried list comprehension and nested for loops, nothing has worked yet. Can anyone help?

# doesn't remove anything
for item in URLs:
    for string in locs:
        re.sub(string, '', item)

# doesn't remove anything
for item in URLs:
    for string in locs:
        item.strip(string)

# only removes the last string in locs
clean = []
for item in URLs:
    for string in locs:
        new = item.replace(string, '')
    clean.append(new)

You have to assign the result of replace to item again:

clean = []
for item in URLs:
    for loc in locs:
        item = item.replace(loc, '')
    clean.append(item)

or in short:

clean = [
    reduce(lambda item,loc: item.replace(loc,''), [item]+locs)
    for item in URLs
]

The biggest problem you have is that you don't save the return value.

urls = ['example.com/page.html',
        'www.example.com/in/page.html',
        'example.com/ca/fr/page.html',
        'm.example.com/de/page.html',
        'example.com/fr/page.html']

locs = ['/in', '/ca', '/de', '/fr', 'm.', 'www.']

stripped = list(urls) ## create a new copy, not necessary

for loc in locs:
    stripped = [url.replace(loc, '') for url in stripped]

After this, stripped is equal to

['example.com/page.html',
 'example.com/page.html',
 'example.com/page.html',
 'example.com/page.html',
 'example.com/page.html']

EDIT

Alternatively, without creating a new list, you can do

for loc in locs:
    urls = [url.replace(loc, '') for url in urls]

After this, urls is equal to

['example.com/page.html',
 'example.com/page.html',
 'example.com/page.html',
 'example.com/page.html',
 'example.com/page.html']

You could first abstract the removing part into a function and then use a list comprehension:

def remove(target, strings):
    for s in strings:
        target = target.replace(s,'')
    return target

URLs = ['example.com/page.html',
        'www.example.com/in/page.html',
        'example.com/ca/fr/page.html',
        'm.example.com/de/page.html',
        'example.com/fr/page.html']

locs = ['/in', '/ca', '/de', '/fr', 'm.', 'www.']

Used like:

URLs = [remove(url,locs) for url in URLs]

for url in URLs: print(url)

output:

example.com/page.html
example.com/page.html
example.com/page.html
example.com/page.html
example.com/page.html

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM