简体   繁体   English

如何从单词列表中删除相似的单词?

[英]How to remove similar words from a list of words?

list1=['Usha', 'Das', 'Anas', 'Chand', 'Tokyo', 'Milkha Singh', 'Gurbachan Singh Randhawa', 'PT Usha', 'KM Beenamol', 'Hima Das', 'under-20', 'Muhammed Anas', 'Dutee Chand', 'the Asian Games', 'Asian Games', 'Olympic Games']

From the above list you can see that Das and Hima Das are repeating.I want only full names that is Hima Das.从上面的列表中可以看到Das和Hima Das在重复。我只想要全名是Hima Das。 Similarly with Usha and PT Usha.与 Usha 和 PT Usha 类似。

o/p what I need o/p 我需要什么

['Tokyo', 'Milkha Singh', 'Gurbachan Singh Randhawa', 'PT Usha', 'KM Beenamol', 'Hima Das', 'under-20', 'Muhammed Anas', 'Dutee Chand', 'Asian Games', 'Olympic Games']

也许是列表理解 + any

print([i for i in list1 if not any(i in x and i!=x for x in list1)])

Create a set of the individual words from every element in the list with more than one word.使用多个单词从列表中的每个元素创建一组单个单词。

Then use a list-comprehension to filter elements that are entirely in that set.然后使用列表理解来过滤完全在该集合中的元素。

This solution is O(n) which is the best you can do efficiency-wise (better than just checking in the list as that is O(n^2) ).这个解决方案是O(n) ,这是您在效率方面可以做到的最好的(比仅仅检查列表好,因为它是O(n^2) )。

parts = {w for e in list1 if ' ' in e for w in e.split()}
out = [e for e in list1 if e not in parts]

I have solved a similar problem by using the Fuzzy Wuzzy library.我使用Fuzzy Wuzzy库解决了类似的问题。 It will return words which are similar to other items in your list based on a number of factors.它将根据多种因素返回与列表中其他项目相似的单词。

all_names=['Usha', 'Das', 'Anas', 'Chand', 'Tokyo', 'Milkha Singh', 'Gurbachan Singh Randhawa', 'PT Usha', 'KM Beenamol', 'Hima Das', 'under-20', 'Muhammed Anas', 'Dutee Chand', 'the Asian Games', 'Asian Games', 'Olympic Games']

for name in list1:
    matches = fuzzy.extractBests(name, list1)

From here you can find the longest match in the matches list and treat this as your "candidate" match.从这里您可以找到匹配列表中最长的匹配并将其视为您的“候选”匹配。 eg "Das" will match "Hima Das" to some degree so they will be returned, and based on length you will choose "Hima Das".例如,“Das”将在一定程度上匹配“Hima Das”,因此它们将被返回,并根据长度选择“Hima Das”。

Then add the candidate matches to a set to ensure they are unique.然后将候选匹配项添加到集合中以确保它们是唯一的。

list1=['Usha', 'Das', 'Anas', 'Chand', 'Tokyo', 'Milkha Singh',
     'Gurbachan Singh Randhawa', 'PT Usha', 'KM Beenamol', 'Hima Das', 'under-20', 
      'Muhammed Anas', 'Dutee Chand', 'the Asian Games', 'Asian Games', 'Olympic Games']

 new_list = [value for value in list1 if not any(value in value2 for value2 in list1 if value2 != value)]

Using the for loop:使用 for 循环:

list1=['Usha', 'Das', 'Anas', 'Chand', 'Tokyo', 'Milkha Singh', 'Gurbachan Singh Randhawa', 'PT Usha', 'KM Beenamol', 'Hima Das', 'under-20', 'Muhammed Anas', 'Dutee Chand', 'the Asian Games', 'Asian Games', 'Olympic Games']

uniques = []

for i in list1:
   if i not in uniques:
      uniques.append(i)
print(uniques)

Using list comprehension:使用列表理解:

uniques = [(i for i in list1 if not any(i in x and i!=x for x in list1)]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM