简体   繁体   English

最长共同序列组

[英]longest common sequence group

Given the following lines of text 鉴于以下几行文字

TOKYO-BLING.1 H02-AVAILABLE
TOKYO-BLING.1 H02-MIDDLING
TOKYO-BLING.1 H02-TOP
TOKYO-BLING.2 H04-USED
TOKYO-BLING.2 H04-AVAILABLE
TOKYO-BLING.2 H04-CANCELLED
WAY-VERING.1 H03-TOP
WAY-VERING.2 H03-USED
WAY-VERING.2 H03-AVAILABLE
WAY-VERING.1 H03-CANCELLED

I would like to do some parsing to generate somewhat sensible groupings. 我想进行一些分析以生成一些合理的分组。 The list above can be grouped as follows 上面的列表可以分为以下几类

TOKYO-BLING.1 H02-AVAILABLE
TOKYO-BLING.1 H02-MIDDLING
TOKYO-BLING.1 H02-TOP

TOKYO-BLING.2 H04-USED
TOKYO-BLING.2 H04-AVAILABLE
TOKYO-BLING.2 H04-CANCELLED

WAY-VERING.2 H03-USED
WAY-VERING.2 H03-AVAILABLE

WAY-VERING.1 H03-TOP
WAY-VERING.1 H03-CANCELLED

Can anyone suggest an algorithm(or some method) that can scan through a given amount of text and work out that the text can be grouped as above. 任何人都可以提出一种算法(或某种方法),该算法可以扫描给定数量的文本并确定可以按上述方式对文本进行分组。 Obviously each group can be further. 显然,每个小组都可以走得更远。 I guess i am looking for a good solution to looking at a list of phrases and working out how best to group them by some common string sequence. 我想我正在寻找一个好的解决方案来查看短语列表,并找出如何最好地按一些常见的字符串序列对它们进行分组。

Here's one way: 这是一种方法:

  1. Sort your entries 对您的条目进行排序
  2. Determine the length of common prefix between each entry 确定每个条目之间公共前缀的长度
  3. Group your entries by separating the list at points where the common prefix is shorter than that of the previous entry 通过在公用前缀比上一个条目短的公共前缀处分离列表来对条目进行分组

Example implementation: 示例实现:

def common_count(t0, t1):
  "returns the length of the longest common prefix"
  for i, pair in enumerate(zip(t0, t1)):
    if pair[0] != pair[1]:
      return i
  return i

def group_by_longest_prefix(iterable):
  "given a sorted list of strings, group by longest common prefix"
  longest = 0
  out = []

  for t in iterable:
    if out: # if there are previous entries 

      # determine length of prefix in common with previous line
      common = common_count(t, out[-1])

      # if the current entry has a shorted prefix, output previous 
      # entries as a group then start a new group
      if common < longest:
        yield out
        longest = 0
        out = []
      # otherwise, just update the target prefix length
      else:
        longest = common

    # add the current entry to the group
    out.append(t)

  # return remaining entries as the last group
  if out:
    yield out

Example usage: 用法示例:

text = """
TOKYO-BLING.1 H02-AVAILABLE
TOKYO-BLING.1 H02-MIDDLING
TOKYO-BLING.1 H02-TOP
TOKYO-BLING.2 H04-USED
TOKYO-BLING.2 H04-AVAILABLE
TOKYO-BLING.2 H04-CANCELLED
WAY-VERING.1 H03-TOP
WAY-VERING.2 H03-USED
WAY-VERING.2 H03-AVAILABLE
WAY-VERING.1 H03-CANCELLED
"""

T = sorted(t.strip() for t in text.split("\n") if t)

for L in group_by_longest_prefix(T):
  print L

This produces: 这将产生:

['TOKYO-BLING.1 H02-AVAILABLE', 'TOKYO-BLING.1 H02-MIDDLING', 'TOKYO-BLING.1 H02-TOP']
['TOKYO-BLING.2 H04-AVAILABLE', 'TOKYO-BLING.2 H04-CANCELLED', 'TOKYO-BLING.2 H04-USED']
['WAY-VERING.1 H03-CANCELLED', 'WAY-VERING.1 H03-TOP']
['WAY-VERING.2 H03-AVAILABLE', 'WAY-VERING.2 H03-USED']

See it in action here: http://ideone.com/1Da0S 在此处查看其运行情况: http : //ideone.com/1Da0S

You could split each string by whitespace, and then make a dict . 您可以用空格将每个字符串分割开,然后做出dict

This is how I did it: 这是我的方法:

f = open( 'hotels.txt', 'r' )   # read the data
f = f.readlines()               # convert to a list of strings (with newlines)
f = [ i.strip() for i in f ]    # take off the newlines
h = [ i.split(' ') for i in f ] # split using whitespace
                                # now h is a list of lists of strings

keys = [ i[0] for i in h ]      # keys = ['TOKYO-BLING.1','TOKYO-BLING.1',...]
keys = list( set( keys ) )      # take out redundant elements

d = dict()                      # start a dict
for i in keys:                  # initialize dict with empty lists
    d[i] = list()               # (one for each key)

for i in h:                     # for each list in h, append a suffix
    d[i[0]].append(i[1])        # to the appropriate prefix (or key)

This produces: 这将产生:

{'TOKYO-BLING.1': ['H02-AVAILABLE', 'H02-MIDDLING', 'H02-TOP'],\
 'TOKYO-BLING.2': ['H04-USED', 'H04-AVAILABLE', 'H04-CANCELLED'],\
 'WAY-VERING.1': ['H03-TOP', 'H03-CANCELLED'],\
 'WAY-VERING.2': ['H03-USED', 'H03-AVAILABLE']}

Here is mine, it started out shorter: 这是我的,它起初较短:

import os

def prefix_groups(data):
    """Return a dictionary of {prefix:[items]}."""
    lines = data[:]
    groups = dict()
    while lines:
        longest = None
        first = lines.pop()
        for line in lines:
            prefix = os.path.commonprefix([first, line])
            if not longest:
                longest = prefix
            elif len(prefix) > len(longest):
                longest = prefix
        if longest:
            group = [first]
            rest = [item for item in lines if longest in item]
            [lines.remove(item) for item in rest]
            group.extend(rest)
            groups[longest] = group
        else:
            # Singletons raise an exception
            raise IndexError("No prefix match for {}!".format(first))
    return groups

if __name__ == "__main__":
    from pprint import pprint
    data = """
    TOKYO-BLING.1 H02-AVAILABLE
    TOKYO-BLING.1 H02-MIDDLING
    TOKYO-BLING.1 H02-TOP
    TOKYO-BLING.2 H04-USED
    TOKYO-BLING.2 H04-AVAILABLE
    TOKYO-BLING.2 H04-CANCELLED
    WAY-VERING.1 H03-TOP
    WAY-VERING.2 H03-USED
    WAY-VERING.2 H03-AVAILABLE
    WAY-VERING.1 H03-CANCELLED
    """
    data = [line.strip() for line in data.split('\n') if line.strip()]
    groups = prefix_groups(data)
    pprint(groups)

Output: 输出:

{'TOKYO-BLING.1 H02-': ['TOKYO-BLING.1 H02-AVAILABLE',
                        'TOKYO-BLING.1 H02-MIDDLING',
                        'TOKYO-BLING.1 H02-TOP'],
 'TOKYO-BLING.2 H04-': ['TOKYO-BLING.2 H04-USED',
                        'TOKYO-BLING.2 H04-AVAILABLE',
                        'TOKYO-BLING.2 H04-CANCELLED'],
 'WAY-VERING.1 H03-': ['WAY-VERING.1 H03-TOP', 'WAY-VERING.1 H03-CANCELLED'],
 'WAY-VERING.2 H03-': ['WAY-VERING.2 H03-USED', 'WAY-VERING.2 H03-AVAILABLE']}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM