简体   繁体   English

从字符串列表中找到最佳子集以匹配给定的字符串

[英]find best subset from list of strings to match a given string

I have a string 我有一个弦

s = "mouse"

and a list of string 和一个字符串列表

sub_strings = ["m", "o", "se", "e"]

I need to find out what is the best and shortest matching subset of sub_strings the list that matches s. 我需要找出匹配s的列表中sub_strings的最佳和最短匹配子集是什么。 What is the best way to do this? 做这个的最好方式是什么? The ideal result would be ["m", "o", "se"] since together they spell mose 理想的结果是[“ m”,“ o”,“ se”],因为它们一起拼写了mose

import difflib
print difflib.get_close_matches(target_word,list_of_possibles)

but unfortunately it would not work for your example above you can use Levenstein distance instead... 但不幸的是,它对于上面的示例不起作用,您可以改用Levenstein距离...

def levenshtein_distance(first, second):
    """Find the Levenshtein distance between two strings."""
    if len(first) > len(second):
        first, second = second, first
    if len(second) == 0:
        return len(first)
    first_length = len(first) + 1
    second_length = len(second) + 1
    distance_matrix = [[0] * second_length for x in range(first_length)]
    for i in range(first_length):
       distance_matrix[i][0] = i
    for j in range(second_length):
       distance_matrix[0][j]=j
    for i in xrange(1, first_length):
        for j in range(1, second_length):
            deletion = distance_matrix[i-1][j] + 1
            insertion = distance_matrix[i][j-1] + 1
            substitution = distance_matrix[i-1][j-1]
            if first[i-1] != second[j-1]:
                substitution += 1
            distance_matrix[i][j] = min(insertion, deletion, substitution)
    return distance_matrix[first_length-1][second_length-1]

sub_strings = ["mo", "m,", "o", "se", "e"]
s="mouse"
print sorted(sub_strings,key = lambda x:levenshtein_distance(x,s))[0]

this will always give you the "closest" word to your target(for some definition of closest) 这将始终为您提供与目标“最接近”的词(用于一些最接近的定义)

levenshtein function stolen from :http://www.korokithakis.net/posts/finding-the-levenshtein-distance-in-python/ levenshtein函数从以下位置被盗:http://www.korokithakis.net/posts/finding-the-levenshtein-distance-in-python/

You can use a regular expression: 您可以使用正则表达式:

import re

def matches(s, sub_strings):
    sub_strings = sorted(sub_strings, key=len, reverse=True)
    pattern = '|'.join(re.escape(substr) for substr in sub_strings)
    return re.findall(pattern, s)

This is at least short and quick, but it will not necessarily find the best set of matches; 这至少是简短而快速的,但不一定能找到最佳的匹配项。 it is too greedy. 太贪心了 For example, 例如,

matches("bears", ["bea", "be", "ars"])

returns ["bea"] , when it should return ["be", "ars"] . 返回["bea"] ,应在何时返回["be", "ars"]


Explanation of the code: 代码说明:

  • The first line sorts the substrings by length, so that the longest strings appear at the start of the list. 第一行按长度对子字符串进行排序,以使最长的字符串出现在列表的开头。 This makes sure that the regular expression will prefer longer matches over shorter ones. 这确保了正则表达式将首选长匹配而不是短匹配。

  • The second line creates a regular expression pattern consisting of all the substrings, separated by the | 第二行创建由所有子字符串组成的正则表达式模式,用|分隔| symbol, which means “or”. 符号,表示“或”。

  • The third line just uses the re.findall function to find all matches of the pattern in the given string s . 第三行仅使用re.findall函数查找给定字符串s模式的所有匹配项。

This solution is based on this answer by user Running Wild . 该解决方案基于用户Running Wild的 答案 It uses the acora package by Stefan Behnel to efficiently find all the matches of the substrings in the target using the Aho–Corasick algorithm and then uses dynamic programming to find the answer. 它使用Stefan Behnel提供acora软件包,通过Aho–Corasick算法有效地找到目标中子字符串的所有匹配项,然后使用动态编程来找到答案。

import acora
import collections

def best_match(target, substrings):
    """
    Find the best way to cover the string `target` by non-overlapping
    matches with strings taken from `substrings`. Return the best
    match as a list of substrings in order. (The best match is one
    that covers the largest number of characters in `target`, and
    among all such matches, the one using the fewest substrings.)

    >>> best_match('mouse', ['mo', 'ou', 'us', 'se'])
    ['mo', 'us']
    >>> best_match('aaaaaaa', ['aa', 'aaa'])
    ['aaa', 'aa', 'aa']
    >>> best_match('abracadabra', ['bra', 'cad', 'dab'])
    ['bra', 'cad', 'bra']
    """
    # Find all occurrences of the substrings in target and store them
    # in a dictionary by their position.
    ac = acora.AcoraBuilder(*substrings).build()
    matches = collections.defaultdict(set)
    for match, pos in ac.finditer(target):
        matches[pos].add(match)

    n = len(target)
    # Array giving the best (score, list of matches) found so far, for
    # each initial substring of the target.
    best = [(0, []) for _ in xrange(n + 1)]
    for i in xrange(n):
        bi = best[i]
        bj = best[i + 1]
        if bi[0] > bj[0] or bi[0] == bj[0] and len(bi[1]) < bj[1]:
            best[i + 1] = bi
        for m in matches[i]:
            j = i + len(m)
            bj = best[j]
            score = bi[0] + len(m)
            if score > bj[0] or score == bj[0] and len(bi[1]) < len(bj[1]):
                best[j] = (score, bi[1] + [m])
    return best[n][1]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM