简体   繁体   中英

Python - How do I Remove String from List if Substring is a match (regex)?

Apologies if this has been asked before but im wrecking my head over it and I've googled for hours on this one trying to see if there is a similar solution.

I've a list of url's in which the last 6 characters within '/' '/' are digits eg: www.test.com/nothere/432432/

I'm trying to write the code so that if there is a match to the substring in the position its in in the string it doesnt get added to the list. The url's im "looking at" are all of the same format hence the use of the regex in the example.

I've tried various if re.match if re.search etc etc and nothing i can put together seems to work.

This is my latest attempt:

list = ['www.test.com/nothere/432432/', 'www.test.com/nothere/685985/', 'www.test.com/nothere/655985/', 'www.test.com/nothere/112113/']

regex = re.compile(r'(/\d{6}/)')
filtered = [i for i in list if not regex.match(i)]
print(filtered)

My understanding for this is that if the regex.match(i) is not triggered then the item gets added. Otherwise dont. But that is clearly not the case and it adds them all irregardless:/

Any and all help is appriciated.

Thanks!

EDIT

Another version ive tried which does nothing:

            regex = re.match(r'(/\d{6}/)', Adlink) in allAdLinks
            if regex:
                allAdLinks.remove(Adlink)
                print(allAdLinks)
            else:
                print("try again")
                continue

IIUC, you want to remove all entries from your list where the final 6 digits have already been seen in another url in the list. You can do that by making a set of the final 6 digits and then processing the list, keeping the page only if its last 6 digits are in the set (and removing them when found):

urls = [
 'www.test.com/nothere/432432/',
 'www.test.com/nothere/685985/',
 'www.test.com/nothere/655985/',
 'www.test.com/nothere/112113/',
 'www.test.com/another/685985/'
]
pages = set(url[-7:] for url in urls)
result = []
for url in urls:
    if url[-7:] in pages:
         result.append(url)
         pages.remove(url[-7:])
print(result)

Output:

[
 'www.test.com/nothere/432432/',
 'www.test.com/nothere/685985/',
 'www.test.com/nothere/655985/',
 'www.test.com/nothere/112113/'
]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM