简体   繁体   中英

Python: Grouping similar words

New to python, have ran into a problem doing this task: there's a list of words, group them with the following rules: They are similar if letters in only 𝗐𝗈𝗋𝖽𝟣 are used to form 𝗐𝗈𝗋𝖽𝟤 and vice versa, otherwise they are not.

For example:

word_list = ['arts', 'rats', 'star', 'tars', 'start', 'pat', 'allergy', 'lager', 'largely', 'regally', 'apt',
             'potters', 'tap', 'bluest', 'tap', 'bluets', 'retraced', 'gallery', 'bustle', 'sublet', 'subtle', 'grab']

output = ['arts', 'rats', 'star', 'tars' and 'start'], [..., ....]

I am stuck for hours, how should I tackle this?

collections.defaultdict and frozenset (can't use set since it'd be mutable) lend to an elegant solution:

>>> import collections
>>> word_list = ['arts', 'rats', 'star', 'tars', 'start', 'pat', 'allergy', 'lager', 'largely', 'regally', 'apt',
...              'potters', 'tap', 'bluest', 'tap', 'bluets', 'retraced', 'gallery', 'bustle', 'sublet', 'subtle', 'grab']
>>> groups = collections.defaultdict(set)
>>> for word in word_list:
...     groups[frozenset(word)].add(word)
...
>>> print(groups)
defaultdict(<class 'set'>,
    {
        frozenset({'t', 'a', 's', 'r'}): {'rats', 'start', 'star', 'arts', 'tars'},
        frozenset({'t', 'p', 'a'}): {'pat', 'apt', 'tap'},
        frozenset({'g', 'e', 'y', 'l', 'r', 'a'}): {'allergy', 'gallery', 'largely', 'regally'},
        frozenset({'g', 'e', 'l', 'r', 'a'}): {'lager'},
        frozenset({'o', 'e', 's', 't', 'p', 'r'}): {'potters'},
        frozenset({'b', 'e', 'u', 's', 'l', 't'}): {'sublet', 'subtle', 'bluets', 'bustle', 'bluest'},
        frozenset({'e', 'd', 'c', 't', 'r', 'a'}): {'retraced'},
        frozenset({'g', 'b', 'r', 'a'}): {'grab'},
    })
>>>

You can try:

word_list = ['arts', 'rats', 'star', 'tars', 'start', 'pat', 'allergy', 'lager', 'largely', 'regally', 'apt',
             'potters', 'tap', 'bluest', 'tap', 'bluets', 'retraced', 'gallery', 'bustle', 'sublet', 'subtle', 'grab']

output = {}
def split(word):
    return [char for char in word]

for word in word_list:
    ascending_word = split(word)
    unique = "".join(set(sorted(ascending_word)))
    if unique not in output:
        output[unique] = []
    output[unique].append(word)

print(list(output.values()))

Output :

[['arts', 'rats', 'star', 'tars', 'start'], ['pat', 'apt', 'tap', 'tap'], ['allergy', 'largely', 'regal
ly', 'gallery'], ['lager'], ['potters'], ['bluest', 'bluets', 'bustle', 'sublet', 'subtle'], ['retraced
'], ['grab']]

Since the intent of the platform is learning by problem-solving I will try and help by describing a simple approach that you can adopt rather than giving you ready code.

Your problem looks like a grouping of Anagrams together but with a small caveat. You can have a valid grouping even though the frequency of characters do NOT match.
For example, you have grouped rats & start together because both of them have the same type of characters. Hence your problem is now reduced to minimally finding out what all words have similar character compositions.
There are a variety of ways to proceed from here. I will be describing the algorithm:

loop the list 0..(N-1):
  use char-composition as key and push the entry to a list of the respective bucket

char-composition : All the unique characters in the word sorted. For eg, rats = arst as key. Hence you will get all the relevant words grouped in the same bucket and then you can just print the corresponding list.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM