简体   繁体   中英

Python removing similar urls from a list

So I am fairly new to python. But would like your help solving this minor issue I having removing similar duplicates out of a list.

So I have a list of urls: myList = ['http://www.mywebsite.com/shoes', 'http://wwww.yourwebsite.com/', 'http://www.mywebsite.com/shoes/']

I want to remove similar url's as you can see http://www.mywebsite.com/shoes and http://www.mywebsite.com/shoes/ are pretty much the same. I would like to remove one of them (I don't care which one) But keep the other. Essentially removing the duplicate from the list. I would give an example. But I don't even know where to begin.

Any insight would great help.

If similarity for you is difference in '\' thab you can use sets (read tutorial here) and here to remove duplicates from your list since:

A set object is an unordered collection of distinct hashable objects. Common uses include membership testing, removing duplicates from a sequence, and computing mathematical operations such as intersection, union, difference, and symmetric difference.

myList = ['http://www.mywebsite.com/shoes', 'http://wwww.yourwebsite.com/', 'http://www.mywebsite.com/shoes/']

set(x.lstrip('\') for x in myList) # will return a set of unique urls

# In case you need list
myList = list(set(x.rstrip('\') for x in myList))

You could do:

  1. First, remove last slash
  2. List item

Remove duplicates:

set(map(lambda url: url.rstrip('/'), myList))

The issue may be that you haven't figured out exactly what it means for two URLs to be similar. We can't help you with that because only you know what your requirements are. Once you do figure that out, though, the rest is simple enough. There are two ways to do it:

  • If your similarity relation is transitive - that is, if similar(a,b) and similar(b,c) implies that similar(a,c) for all URLs a , b , c - then it will be possible to convert each URL to a canonical form. Two URLs will be similar if and only if their canonical forms are equal. So the easiest thing to do in that case is to convert each URL to canonical form, then create a set out of the canonical URLs obtained in this way:

     set(canonical(u) for u in myList)
  • If your similarity relation is not transitive, then things get really tricky because you can have cases like A being similar to B and B being similar to C, but A is not similar to C. So then the question becomes, in this example, what would you like to include in your list with duplicates stripped? Would you include A and C because they are not similar to each other, or would you include only B because you'd consider both A and C similar duplicates of it? In this case, depending on how you want to handle "fuzzy" cases like this, there are various algorithms you can use - but again, we'd need to know your exact requirements to recommend anything.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM