简体   繁体   English

Python字符串模式识别/压缩

[英]Python string pattern recognition/compression

I can do basic regex alright, but this is slightly different, namely I don't know what the pattern is going to be. 我可以做基本的正则表达式,但这有点不同,即我不知道模式是什么。

For example, I have a list of similar strings: 例如,我有一个类似字符串的列表:

lst = ['asometxt0moretxt', 'bsometxt1moretxt', 'aasometxt10moretxt', 'zzsometxt999moretxt']

In this case the common pattern is two segments of common text: 'sometxt' and 'moretxt' , starting and separated by something else that is variable in length. 在这种情况下,常见模式是两段常见文本: 'sometxt''moretxt' ,以长度可变的其他内容开始和分隔。

The common string and variable string can of course occur at any order and at any number of occasions. 公共字符串和变量字符串当然可以在任何顺序和任何数量的场合发生。

What would be a good way to condense/compress the list of strings into their common parts and individual variations? 什么是将字符串列表压缩/压缩为公共部分和个别变体的好方法?

An example output might be: 示例输出可能是:

c = ['sometxt', 'moretxt']

v = [('a','0'), ('b','1'), ('aa','10'), ('zz','999')]

This solution finds the two longest common substrings and uses them to delimit the input strings: 此解决方案找到两个最长的公共子串并使用它们来分隔输入字符串:

def an_answer_to_stackoverflow_question_1914394(lst):
    """
    >>> lst = ['asometxt0moretxt', 'bsometxt1moretxt', 'aasometxt10moretxt', 'zzsometxt999moretxt']
    >>> an_answer_to_stackoverflow_question_1914394(lst)
    (['sometxt', 'moretxt'], [('a', '0'), ('b', '1'), ('aa', '10'), ('zz', '999')])
    """
    delimiters = find_delimiters(lst)
    return delimiters, list(split_strings(lst, delimiters))

find_delimiters and friends finds the delimiters: find_delimiters和朋友找到分隔符:

import itertools

def find_delimiters(lst):
    """
    >>> lst = ['asometxt0moretxt', 'bsometxt1moretxt', 'aasometxt10moretxt', 'zzsometxt999moretxt']
    >>> find_delimiters(lst)
    ['sometxt', 'moretxt']
    """
    candidates = list(itertools.islice(find_longest_common_substrings(lst), 3))
    if len(candidates) == 3 and len(candidates[1]) == len(candidates[2]):
        raise ValueError("Unable to find useful delimiters")
    if candidates[1] in candidates[0]:
        raise ValueError("Unable to find useful delimiters")
    return candidates[0:2]

def find_longest_common_substrings(lst):
    """
    >>> lst = ['asometxt0moretxt', 'bsometxt1moretxt', 'aasometxt10moretxt', 'zzsometxt999moretxt']
    >>> list(itertools.islice(find_longest_common_substrings(lst), 3))
    ['sometxt', 'moretxt', 'sometx']
    """
    for i in xrange(min_length(lst), 0, -1):
        for substring in common_substrings(lst, i):
            yield substring


def min_length(lst):
    return min(len(item) for item in lst)

def common_substrings(lst, length):
    """
    >>> list(common_substrings(["hello", "world"], 2))
    []
    >>> list(common_substrings(["aabbcc", "dbbrra"], 2))
    ['bb']
    """
    assert length <= min_length(lst)
    returned = set()
    for i, item in enumerate(lst):
        for substring in all_substrings(item, length):
            in_all_others = True
            for j, other_item in enumerate(lst):
                if j == i:
                    continue
                if substring not in other_item:
                    in_all_others = False
            if in_all_others:
                if substring not in returned:
                    returned.add(substring)
                    yield substring

def all_substrings(item, length):
    """
    >>> list(all_substrings("hello", 2))
    ['he', 'el', 'll', 'lo']
    """
    for i in range(len(item) - length + 1):
        yield item[i:i+length]

split_strings splits the strings using the delimiters: split_strings使用分隔符拆分字符串:

import re

def split_strings(lst, delimiters):
    """
    >>> lst = ['asometxt0moretxt', 'bsometxt1moretxt', 'aasometxt10moretxt', 'zzsometxt999moretxt']
    >>> list(split_strings(lst, find_delimiters(lst)))
    [('a', '0'), ('b', '1'), ('aa', '10'), ('zz', '999')]
    """
    for item in lst:
        parts = re.split("|".join(delimiters), item)
        yield tuple(part for part in parts if part != '')

Here is a scary one to get the ball rolling. 这是一个可怕的球,让球滚动。

>>> import re
>>> makere = lambda n: ''.join(['(.*?)(.+)(.*?)(.+)(.*?)'] + ['(.*)(\\2)(.*)(\\4)(.*)'] * (n - 1))
>>> inp = ['asometxt0moretxt', 'bsometxt1moretxt', 'aasometxt10moretxt', 'zzsometxt999moretxt']
>>> re.match(makere(len(inp)), ''.join(inp)).groups()
('a', 'sometxt', '0', 'moretxt', '', 'b', 'sometxt', '1', 'moretxt', 'aa', '', 'sometxt', '10', 'moretxt', 'zz', '', 'sometxt', '999', 'moretxt', '')

I hope its sheer ugliness will inspire better solutions :) 我希望它纯粹的丑陋会激发更好的解决方案:)

This seems to be an example of the longest common subsequence problem . 这似乎是最常见的子序列问题的一个例子。 One way could be to look at how diffs are generated. 一种方法是查看如何生成差异 The Hunt-McIlroy algorithm seems to have been the first, and is such the simplest, especially since it apparently is non-heuristic. Hunt-McIlroy算法似乎是第一个,并且是最简单的,特别是因为它显然是非启发式的。

The first link contains detailed discussion and (pseudo) code examples. 第一个链接包含详细讨论和(伪)代码示例。 Assuming, of course, Im not completely of the track here. 当然,假设我不完全是这里的轨道。

I guess you should start by identifying substrings (patterns) that frequently occur in the strings. 我想你应该从识别字符串中经常出现的子串(模式)开始。 Since naively counting substrings in a set of strings is rather computationally expensive, you'll need to come up with something smart. 由于在一组字符串中天真地计算子字符串的计算成本相当昂贵,因此您需要提出一些智能的东西。

I've done substring counting on a large amount of data using generalized suffix trees (example here) . 我使用通用后缀树 (这里的例子)对大量数据进行了子串计数。 Once you know the most frequent substrings/patterns in the data, you can take it from there. 一旦您知道数据中最常见的子串/模式,就可以从那里获取它。

This look much like the LZW algorithm for data (text) compression. 这看起来很像用于数据(文本)压缩的LZW算法。 There should be python implementations out there, which you may be able to adapt to your need. 应该有python实现,你可以根据自己的需要进行调整。

I assume you have no a priori knowledge of these sub strings that repeat often. 我假设您对这些经常重复的子字符串没有先验知识。

How about subbing out the known text, and then splitting? 如何解析已知文本,然后拆分?

import re
[re.sub('(sometxt|moretxt)', ',', x).split(',') for x in lst]
# results in
[['a', '0', ''], ['b', '1', ''], ['aa', '10', ''], ['zz', '999', '']]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM