Merging consecutive items in a list if they occur more than once python

Question

I am looking to create an algorithm that can merge consecutive items in a list if they occur multiple times throughout it. Would appreciate seeing any approaches to this!

The input is a list with each item being its own character, and output is also a list.

Here's an example to clarify:

Let's say my string is "hello yellow".

We will convert it to a list.

Ie ['h', 'e', 'l', 'l', 'o', ' ', 'y', 'e', 'l', 'l', 'o', 'w']

Then, we want to see which consecutive items occur more than once. Starting from the left, ['e', 'l'] occurs more than once.

We merge them to be 1 item, instead of 2 in the list.

['h', 'el', 'l', 'o', ' ', 'y', 'el', 'l', 'o', 'w']

Now, we see 'el', 'l' occurs more than once. We merge them together, as so.

['h', 'ell', 'o', ' ', 'y', 'ell', 'o', 'w']

Now, we merge 'ell', 'o' together since they occur more than once.

['h', 'ello', ' ', 'y', 'ello', 'w']

This is the final output: ['h', 'ello', ' ', 'y', 'ello', 'w']

I want to be able to do this for any input.

Like, another example of an input would be the list:

['h', 'e', 'l', 'l', 'o', ' ', 'h', 'e', 'l', 'l', 'o', ' ', 'w', 'o', 'r', 'l', 'd']

The output would be:

['hello ', 'hello ', 'w', 'o', 'r', 'l', 'd']

I tried the following:

s = "hello there"

def merge_items(s):
  d = {}
  for i in range(0, len(s)):
      k = s[i:i+2]
      d[k] = d.setdefault(k, 0) + 1
  print(d)

  l = []
  for i in range(0, len(s)):
      k = s[i:i + 2]
      if d[k] > 1:
        l.append(k)
      else:
        l.extend(s[i])
  return l
  
print(merge_items(s))

The 'e' is printed twice here, and it doesn't work for other inputs, such as "hello hello". I'm having trouble expanding it to strings that have more than 2 characters repeating.

Not sure how to improve this, as I am very beginner to Python.

Output:

{'he': 2, 'el': 1, 'll': 1, 'lo': 1, 'o ': 1, ' t': 1, 'th': 1, 'er': 1, 're': 1, 'e': 1}
['he', 'e', 'l', 'l', 'o', ' ', 't', 'he', 'e', 'r', 'e']

If I were to input "hello hello world" as the string, the output is this:

{'he': 2, 'el': 2, 'll': 2, 'lo': 2, 'o ': 2, ' h': 1, ' w': 1, 'wo': 1, 'or': 1, 'rl': 1, 'ld': 1, 'd': 1}
['he', 'el', 'll', 'lo', 'o ', ' ', 'he', 'el', 'll', 'lo', 'o ', ' ', 'w', 'o', 'r', 'l', 'd']

Right now, I am counting pairs, but am unsure about how to merge "hello" into one item.

Answer 1

If you're trying to do multiple substitutions at once, it's hard to keep them from colliding with each other. Easier to just do one substitution per pass and iterate until there's nothing left to do:

from collections import Counter
from typing import Sequence


def merge_items(s: Sequence[str]) -> Sequence[str]:
    def merge_once(s: Sequence[str]) -> Sequence[str]:
        if len(s) <= 1:
            return s
        ss = list(zip(s, s[1:])) + [(s[-1], '')]
        c = Counter(ss)
        try:
            a, b = next(p for p, count in c.items() if count > 1)
        except StopIteration:
            return s
        i = 0
        ret = []
        while i < len(ss):
            x, y = ss[i]
            if x + y == a + b:
                ret.append(x + y)
                i += 2
            else:
                ret.append(x)
                i += 1
        return ret
    t = merge_once(s)
    while t != s:
        s = t
        t = merge_once(s)
    return s


print(merge_items("hello there"))  
# ['he', 'l', 'l', 'o', ' ', 't', 'he', 'r', 'e']

print(merge_items("hello hello world"))  
# ['hello ', 'hello ', 'w', 'o', 'r', 'l', 'd']

Answer 2

This is a dynamic programming-ish approach – the idea is to represent the string as a matrix to compare each character pair. For example, 'hello yellow' can be translated into the matrix below. Note that the sequence we're after is now identifiable as the longest diagonal sequence of 1 s (excluding the diagonal of course), which gives us 'ello'.

#  h  e  l  l  o     y  e  l  l  o  w
h [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
e [0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0]
l [0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0]
l [0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0]
o [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0]
  [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
y [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
e [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
l [0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0]
l [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
o [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
w [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

In fact, we don't even need the full grid. It's enough to collect the coordinates with identical characters into a set and get the longest subsequence afterwards, which is more or less straightforward.

def get_pairs(s):
    """
    Input: a string or list of characters
    Output: a set {(x, y), (x2, y2), ...} where s[x] == s[y]
    """
    n = len(s)

    pairs = set()

    # This iterates through the upper half of the imaginary matrix
    for y in range(n):
        for x in range(y+1, n):
            if s[x] == s[y]:
                pairs.add((x,y)) # collect matching pairs

    return pairs


# There's some potential for optimization here, but you get the idea.
def longest_subsequence(pairs):
    """
    Input: A sequence of coordinates [(x, y), (x2, y2), ...] 
    Output: The longest subsequence where [(x, y), (x+1, y+1), (x+2, y+2)] applies
    """
    longest = []

    for p in pairs:
        seq = [p]
        x, y = p

        # keep collecting items on the diagonal
        while True:
            x, y = x+1, y+1

            if (x,y) in pairs:
                seq.append((x,y))
            else:
                break
                
        if len(seq) > len(longest):
            longest = seq
    
    return longest

Running some tests, this seems to work as expected, with one issue: If the string consists of only one character, the result seems counter-intuitive. In general, overlapping matches seem to be a problem, but you didn't say much about borderline cases.

test = ['hello yellow',
        'hello hello world',
        'aaaaabaaaaa',
        'aaaaaaaaaa' # this is problematic
       ]

for s in test:
    
    pairs = get_pairs(s)
    result = longest_subsequence(pairs)
    
    print(f'{s=}')
    print(f'{result=}')
    print(repr(''.join(s[i] for i,_ in result)))
    
    print()

Results:

s='hello yellow'
result=[(7, 1), (8, 2), (9, 3), (10, 4)]
'ello'

s='hello hello world'
result=[(6, 0), (7, 1), (8, 2), (9, 3), (10, 4), (11, 5)]
'hello '

s='aaaaabaaaaa'
result=[(6, 0), (7, 1), (8, 2), (9, 3), (10, 4)]
'aaaaa'

s='aaaaaaaaaa'
result=[(1, 0), (2, 1), (3, 2), (4, 3), (5, 4), (6, 5), (7, 6), (8, 7), (9, 8)]
'aaaaaaaaa'

Edit: Forgot about the generalization. The matrix contains enough information to collect multiple matches as well. To do so, we can replace the second function with the one below that collects all subsequences.

from collections import defaultdict

def find_all_subsequences(pairs):
    """
    Input: A sequence of coordinates [(x, y), (x2, y2), ...] 
    Output: A dictionary mapping substrings to sets of subsequences {str: {((x, y),), ...}}
    """
    
    def to_string(result):
        return ''.join(s[i] for i,_ in result)
    
    output = defaultdict(set)
    seen = set() # to prevent double matches
    
    for p in pairs:
        
        if p in seen:
            continue
                
        seq = [p]
        x, y = p

        # collect diagonal sequences
        while True:
            x, y = x+1, y+1
            
            if (x,y) in pairs:
                seq.append((x,y))
                
                # keep track of visited elements
                seen.add((x, y))
                
                # add to the output
                output[to_string(seq)].add(tuple(seq))
            else:
                output[to_string(seq)].add(tuple(seq))
                break

    return output

This will work similar to the function above, but get all the combinations, eg,

s = "hello yellow marshmellow, don't yell"

pairs = get_pairs(s)
result = find_all_subsequences(pairs)

for k in sorted(result, key=len, reverse=True):
    v = result[k]
    print(f"Sequence: {k!r}, length: {len(k)}, occurrences: {len(v)+1}")
    #print(v) # uncomment to see the raw data

returns all of the subsequences as shown below. You can filter them according to your needs.

Sequence: ' yell', length: 5, occurrences: 2
Sequence: 'ellow', length: 5, occurrences: 2
Sequence: ' yel', length: 4, occurrences: 2
Sequence: 'llow', length: 4, occurrences: 2
Sequence: 'ello', length: 4, occurrences: 4
Sequence: ' ye', length: 3, occurrences: 2
Sequence: 'llo', length: 3, occurrences: 2
Sequence: 'ell', length: 3, occurrences: 6
Sequence: ' y', length: 2, occurrences: 2
Sequence: 'll', length: 2, occurrences: 2
Sequence: 'el', length: 2, occurrences: 6
Sequence: 'lo', length: 2, occurrences: 2
Sequence: 'h', length: 1, occurrences: 2
Sequence: 'o', length: 1, occurrences: 4
Sequence: 'l', length: 1, occurrences: 17
Sequence: 'm', length: 1, occurrences: 2
Sequence: ' ', length: 1, occurrences: 6

Merging consecutive items in a list if they occur more than once python

Question

2 answers

solution1
0 2021-07-24 22:16:35

solution2
0 2021-07-25 13:11:30

Results:

Merging consecutive items in a list if they occur more than once python

Question

2 answers

solution1 0 2021-07-24 22:16:35

solution2 0 2021-07-25 13:11:30

Results:

solution1
0 2021-07-24 22:16:35

solution2
0 2021-07-25 13:11:30