longest common sequence group

Question

Given the following lines of text

TOKYO-BLING.1 H02-AVAILABLE
TOKYO-BLING.1 H02-MIDDLING
TOKYO-BLING.1 H02-TOP
TOKYO-BLING.2 H04-USED
TOKYO-BLING.2 H04-AVAILABLE
TOKYO-BLING.2 H04-CANCELLED
WAY-VERING.1 H03-TOP
WAY-VERING.2 H03-USED
WAY-VERING.2 H03-AVAILABLE
WAY-VERING.1 H03-CANCELLED

I would like to do some parsing to generate somewhat sensible groupings. The list above can be grouped as follows

TOKYO-BLING.1 H02-AVAILABLE
TOKYO-BLING.1 H02-MIDDLING
TOKYO-BLING.1 H02-TOP

TOKYO-BLING.2 H04-USED
TOKYO-BLING.2 H04-AVAILABLE
TOKYO-BLING.2 H04-CANCELLED

WAY-VERING.2 H03-USED
WAY-VERING.2 H03-AVAILABLE

WAY-VERING.1 H03-TOP
WAY-VERING.1 H03-CANCELLED

Can anyone suggest an algorithm(or some method) that can scan through a given amount of text and work out that the text can be grouped as above. Obviously each group can be further. I guess i am looking for a good solution to looking at a list of phrases and working out how best to group them by some common string sequence.

Answer 1

Here's one way:

Sort your entries
Determine the length of common prefix between each entry
Group your entries by separating the list at points where the common prefix is shorter than that of the previous entry

Example implementation:

def common_count(t0, t1):
  "returns the length of the longest common prefix"
  for i, pair in enumerate(zip(t0, t1)):
    if pair[0] != pair[1]:
      return i
  return i

def group_by_longest_prefix(iterable):
  "given a sorted list of strings, group by longest common prefix"
  longest = 0
  out = []

  for t in iterable:
    if out: # if there are previous entries 

      # determine length of prefix in common with previous line
      common = common_count(t, out[-1])

      # if the current entry has a shorted prefix, output previous 
      # entries as a group then start a new group
      if common < longest:
        yield out
        longest = 0
        out = []
      # otherwise, just update the target prefix length
      else:
        longest = common

    # add the current entry to the group
    out.append(t)

  # return remaining entries as the last group
  if out:
    yield out

Example usage:

text = """
TOKYO-BLING.1 H02-AVAILABLE
TOKYO-BLING.1 H02-MIDDLING
TOKYO-BLING.1 H02-TOP
TOKYO-BLING.2 H04-USED
TOKYO-BLING.2 H04-AVAILABLE
TOKYO-BLING.2 H04-CANCELLED
WAY-VERING.1 H03-TOP
WAY-VERING.2 H03-USED
WAY-VERING.2 H03-AVAILABLE
WAY-VERING.1 H03-CANCELLED
"""

T = sorted(t.strip() for t in text.split("\n") if t)

for L in group_by_longest_prefix(T):
  print L

This produces:

['TOKYO-BLING.1 H02-AVAILABLE', 'TOKYO-BLING.1 H02-MIDDLING', 'TOKYO-BLING.1 H02-TOP']
['TOKYO-BLING.2 H04-AVAILABLE', 'TOKYO-BLING.2 H04-CANCELLED', 'TOKYO-BLING.2 H04-USED']
['WAY-VERING.1 H03-CANCELLED', 'WAY-VERING.1 H03-TOP']
['WAY-VERING.2 H03-AVAILABLE', 'WAY-VERING.2 H03-USED']

See it in action here: http://ideone.com/1Da0S

Answer 2

You could split each string by whitespace, and then make a dict .

This is how I did it:

f = open( 'hotels.txt', 'r' )   # read the data
f = f.readlines()               # convert to a list of strings (with newlines)
f = [ i.strip() for i in f ]    # take off the newlines
h = [ i.split(' ') for i in f ] # split using whitespace
                                # now h is a list of lists of strings

keys = [ i[0] for i in h ]      # keys = ['TOKYO-BLING.1','TOKYO-BLING.1',...]
keys = list( set( keys ) )      # take out redundant elements

d = dict()                      # start a dict
for i in keys:                  # initialize dict with empty lists
    d[i] = list()               # (one for each key)

for i in h:                     # for each list in h, append a suffix
    d[i[0]].append(i[1])        # to the appropriate prefix (or key)

This produces:

{'TOKYO-BLING.1': ['H02-AVAILABLE', 'H02-MIDDLING', 'H02-TOP'],\
 'TOKYO-BLING.2': ['H04-USED', 'H04-AVAILABLE', 'H04-CANCELLED'],\
 'WAY-VERING.1': ['H03-TOP', 'H03-CANCELLED'],\
 'WAY-VERING.2': ['H03-USED', 'H03-AVAILABLE']}

Answer 3

通用后缀树或后缀数组将起作用

Answer 4

Here is mine, it started out shorter:

import os

def prefix_groups(data):
    """Return a dictionary of {prefix:[items]}."""
    lines = data[:]
    groups = dict()
    while lines:
        longest = None
        first = lines.pop()
        for line in lines:
            prefix = os.path.commonprefix([first, line])
            if not longest:
                longest = prefix
            elif len(prefix) > len(longest):
                longest = prefix
        if longest:
            group = [first]
            rest = [item for item in lines if longest in item]
            [lines.remove(item) for item in rest]
            group.extend(rest)
            groups[longest] = group
        else:
            # Singletons raise an exception
            raise IndexError("No prefix match for {}!".format(first))
    return groups

if __name__ == "__main__":
    from pprint import pprint
    data = """
    TOKYO-BLING.1 H02-AVAILABLE
    TOKYO-BLING.1 H02-MIDDLING
    TOKYO-BLING.1 H02-TOP
    TOKYO-BLING.2 H04-USED
    TOKYO-BLING.2 H04-AVAILABLE
    TOKYO-BLING.2 H04-CANCELLED
    WAY-VERING.1 H03-TOP
    WAY-VERING.2 H03-USED
    WAY-VERING.2 H03-AVAILABLE
    WAY-VERING.1 H03-CANCELLED
    """
    data = [line.strip() for line in data.split('\n') if line.strip()]
    groups = prefix_groups(data)
    pprint(groups)

Output:

{'TOKYO-BLING.1 H02-': ['TOKYO-BLING.1 H02-AVAILABLE',
                        'TOKYO-BLING.1 H02-MIDDLING',
                        'TOKYO-BLING.1 H02-TOP'],
 'TOKYO-BLING.2 H04-': ['TOKYO-BLING.2 H04-USED',
                        'TOKYO-BLING.2 H04-AVAILABLE',
                        'TOKYO-BLING.2 H04-CANCELLED'],
 'WAY-VERING.1 H03-': ['WAY-VERING.1 H03-TOP', 'WAY-VERING.1 H03-CANCELLED'],
 'WAY-VERING.2 H03-': ['WAY-VERING.2 H03-USED', 'WAY-VERING.2 H03-AVAILABLE']}

longest common sequence group

Question

4 answers

solution1
3 ACCPTED 2012-06-29 14:37:22

solution2
1 2012-06-29 14:00:21

solution3
0 2012-07-04 01:45:44

solution4
0 2012-09-21 23:43:26

longest common sequence group

Question

4 answers

solution1 3 ACCPTED 2012-06-29 14:37:22

solution2 1 2012-06-29 14:00:21

solution3 0 2012-07-04 01:45:44

solution4 0 2012-09-21 23:43:26

solution1
3 ACCPTED 2012-06-29 14:37:22

solution2
1 2012-06-29 14:00:21

solution3
0 2012-07-04 01:45:44

solution4
0 2012-09-21 23:43:26