简体   繁体   中英

find best subset from list of strings to match a given string

I have a string

s = "mouse"

and a list of string

sub_strings = ["m", "o", "se", "e"]

I need to find out what is the best and shortest matching subset of sub_strings the list that matches s. What is the best way to do this? The ideal result would be ["m", "o", "se"] since together they spell mose

import difflib
print difflib.get_close_matches(target_word,list_of_possibles)

but unfortunately it would not work for your example above you can use Levenstein distance instead...

def levenshtein_distance(first, second):
    """Find the Levenshtein distance between two strings."""
    if len(first) > len(second):
        first, second = second, first
    if len(second) == 0:
        return len(first)
    first_length = len(first) + 1
    second_length = len(second) + 1
    distance_matrix = [[0] * second_length for x in range(first_length)]
    for i in range(first_length):
       distance_matrix[i][0] = i
    for j in range(second_length):
       distance_matrix[0][j]=j
    for i in xrange(1, first_length):
        for j in range(1, second_length):
            deletion = distance_matrix[i-1][j] + 1
            insertion = distance_matrix[i][j-1] + 1
            substitution = distance_matrix[i-1][j-1]
            if first[i-1] != second[j-1]:
                substitution += 1
            distance_matrix[i][j] = min(insertion, deletion, substitution)
    return distance_matrix[first_length-1][second_length-1]

sub_strings = ["mo", "m,", "o", "se", "e"]
s="mouse"
print sorted(sub_strings,key = lambda x:levenshtein_distance(x,s))[0]

this will always give you the "closest" word to your target(for some definition of closest)

levenshtein function stolen from :http://www.korokithakis.net/posts/finding-the-levenshtein-distance-in-python/

You can use a regular expression:

import re

def matches(s, sub_strings):
    sub_strings = sorted(sub_strings, key=len, reverse=True)
    pattern = '|'.join(re.escape(substr) for substr in sub_strings)
    return re.findall(pattern, s)

This is at least short and quick, but it will not necessarily find the best set of matches; it is too greedy. For example,

matches("bears", ["bea", "be", "ars"])

returns ["bea"] , when it should return ["be", "ars"] .


Explanation of the code:

  • The first line sorts the substrings by length, so that the longest strings appear at the start of the list. This makes sure that the regular expression will prefer longer matches over shorter ones.

  • The second line creates a regular expression pattern consisting of all the substrings, separated by the | symbol, which means “or”.

  • The third line just uses the re.findall function to find all matches of the pattern in the given string s .

This solution is based on this answer by user Running Wild . It uses the acora package by Stefan Behnel to efficiently find all the matches of the substrings in the target using the Aho–Corasick algorithm and then uses dynamic programming to find the answer.

import acora
import collections

def best_match(target, substrings):
    """
    Find the best way to cover the string `target` by non-overlapping
    matches with strings taken from `substrings`. Return the best
    match as a list of substrings in order. (The best match is one
    that covers the largest number of characters in `target`, and
    among all such matches, the one using the fewest substrings.)

    >>> best_match('mouse', ['mo', 'ou', 'us', 'se'])
    ['mo', 'us']
    >>> best_match('aaaaaaa', ['aa', 'aaa'])
    ['aaa', 'aa', 'aa']
    >>> best_match('abracadabra', ['bra', 'cad', 'dab'])
    ['bra', 'cad', 'bra']
    """
    # Find all occurrences of the substrings in target and store them
    # in a dictionary by their position.
    ac = acora.AcoraBuilder(*substrings).build()
    matches = collections.defaultdict(set)
    for match, pos in ac.finditer(target):
        matches[pos].add(match)

    n = len(target)
    # Array giving the best (score, list of matches) found so far, for
    # each initial substring of the target.
    best = [(0, []) for _ in xrange(n + 1)]
    for i in xrange(n):
        bi = best[i]
        bj = best[i + 1]
        if bi[0] > bj[0] or bi[0] == bj[0] and len(bi[1]) < bj[1]:
            best[i + 1] = bi
        for m in matches[i]:
            j = i + len(m)
            bj = best[j]
            score = bi[0] + len(m)
            if score > bj[0] or score == bj[0] and len(bi[1]) < len(bj[1]):
                best[j] = (score, bi[1] + [m])
    return best[n][1]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM