简体   繁体   中英

Python - partitioning a list of strings using an equivalence relation

I have a list of alphabetic strings [str1,str2,...] which I need to partition into equivalence classes using an equivalence relation R, where str1 R str2 (in relational notation) if str2 can be obtained from str1 by a sequence of valid one-letter changes, where 'valid' means it produces a valid alphabetic word, eg cat --> car is valid but cat --> 'cax is not. If the input list was ['cat','ace','car','zip','ape','pip'] then the code should return [['cat','car'],['ace','ape'],['zip','pip']] .

I've got an initial working version which, however, produces some "classes" which contain duplicates.

I don't suppose there is any Python package which allows me to define such equivalence relations, but even otherwise what would be the best way of doing this?

Should work for different length strings. Obviously, ordering matters.

def is_one_letter_different(s1, s2):
    if len(s1) != len(s2):
        return False
    diff_count = 0;
    for char1, char2 in zip(s1, s2):
        if char1 != char2:
            diff_count += 1
    return diff_count == 1

def group(candidates):
    groups = []
    for candidate in candidates:
        for group in groups:
            for word in group:
                if is_one_letter_different(word, candidate):
                    group.append(candidate)
                    break
            if candidate in group:
                break
        else:
            groups.append([candidate])
    return groups

print group(['bread','breed', 'bream', 'tread', 'treat', 'short', 'shorn', 'shirt', 'shore', 'store','eagle','mired', 'sired', 'hired'])

Output:

[['bread', 'breed', 'bream', 'tread', 'treat'], ['short', 'shorn', 'shirt', 'shore', 'store'], ['eagle'], ['mired', 'sired', 'hired']]

EDIT: Updated to follow additional testcases. I'm not sure of output correctness - please validate. And provide us good testcases next time.

I would do it something like this: construct an undirected graph where each word is a node, and each edge indicates that the relation holds between them. Then you can identify each disconnected "island" in the graph, each of which represents an equivalence class.

from collections import defaultdict

def exactly_one(iter):
    count = 0
    for x in iter:
        if x:
            count += 1
            if count > 1: 
                break
    return count == 1

def are_one_letter_apart(a,b):
    if len(a) != len(b): return False
    return exactly_one(a_char != b_char for a_char, b_char in zip(a,b))

def pairs(seq):
    for i in range(len(seq)):
        for j in range(i+1, len(seq)):
            yield (seq[i], seq[j])

def search(graph, node):
    seen = set()
    to_visit = set()
    to_visit.add(node)
    while to_visit:
        cur = to_visit.pop()
        if cur in seen: continue
        for neighbor in graph[cur]:
            if neighbor not in seen:
                to_visit.add(neighbor)
        seen.add(cur)
    return seen

def get_islands(graph):
    seen = set()
    islands = []
    for item in graph.iterkeys():
        if item in seen: continue
        group = search(graph, item)
        seen = seen | group
        islands.append(group)
    return islands

def create_classes(seq, f):
    graph = defaultdict(list)
    for a,b in pairs(seq):
        if f(a,b):
            graph[a].append(b)
            graph[b].append(a)
    #one last pass to pick up items with no relations to anything else
    for item in seq:
        if item not in graph:
            graph[item].append(item)

    return [list(group) for group in get_islands(graph)]

seq = ['cat','ace','car','zip','ape','pip']
print create_classes(seq, are_one_letter_apart)

Result:

[['ace', 'ape'], ['pip', 'zip'], ['car', 'cat']]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM