简体   繁体   中英

Struggling to write a function that finds 'clusters' of the same data in a list

I am struggling to write a Python function that finds the indices of 'clusters' of the same data within a list. I want it to return a dictionary with keys as the repeating data and values as a list containing the start and end index of each cluster. NOTE: If there are multiple clusters with the same data, I would like a 2D list as the value for that key. To give an example, say I have the list [1, 1, 1, 1, 1, 2, 2, 2, 2, 1, 1, 3, 3, 3] . My function find_clusters(x) should take the list as input and return the following dictionary: {1: [[0, 5], [9, 11]], 2: [5, 9], 3: [11, 14]}

Before worrying about multiple clusters of the same data, I tried to code a function that could handle single clusters, but it's stuck in an infinite loop:

def find_clusters(x):
    cluster_dict = {}
    start_ind = 0
    end_ind = 0
    while end_ind < len(x):
        start_ind = end_ind
        current_data = x[start_ind]
        while x[end_ind] == current_data:
            if end_ind + 1 == len(x):
                break
            else:
                end_ind += 1
        cluster_dict[current_data] = [start_ind, end_ind]

    return cluster_dict

There's no need for (nested) while loops here - you just need to iterate over the list once, keep track of the last value you've seen, and mark the end of a cluster whenever you see a different value. Since you want to return a dict of lists, you can use a defaultdict to store the indices.

from collections import defaultdict

def find_clusters(xs: list):
    clusters = defaultdict(list)
    current_value = xs[0]
    start_idx = 0
    for i, value in enumerate(xs):
        if value != current_value:
            clusters[current_value].append((start_idx, i))
            current_value = value
            start_idx = i
    # Handle final cluster after the loop completes
    clusters[current_value].append((start_idx, len(xs)))
    return clusters
>>> xs = [1, 1, 1, 1, 1, 2, 2, 2, 2, 1, 1, 3, 3, 3]
>>> find_clusters(xs)
defaultdict(<class 'list'>, {1: [(0, 5), (9, 11)], 2: [(5, 9)], 3: [(11, 14)]})

Side note: In your version, the second index of each cluster corresponds to the first occurrence of a different value, not the last occurrence of the current value. This is useful for slicing, but I find it more intuitive to store the last occurrence of the current value and then use xs[start_idx:end_idx+1] when slicing:

defaultdict(<class 'list'>, {1: [(0, 4), (9, 10)], 2: [(5, 8)], 3: [(11, 13)]})

To achieve this, just append (start_idx, i-1) instead of (start_idx, i) (and len(xs)-1 instead of len(xs) at the end).

When if end_ind + 1 == len(x) is True , you do not increment end_ind , so you get stuck. Try:

from collections import defaultdict

def find_clusters(x):
    cluster_dict = defaultdict(list)
    start_ind = 0
    end_ind = 0
    while end_ind < len(x):
        start_ind = end_ind
        while x[end_ind] == x[start_ind]:
            # note that we always increment end_ind heree
            end_ind += 1
            if end_ind == len(x):
                break
        cluster_dict[x[start_ind]].append([start_ind, end_ind])
    return cluster_dict

find_clusters(arr)

Output:

defaultdict(list, {1: [[0, 5], [9, 11]], 2: [[5, 9]], 3: [[11, 14]]})

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM