简体   繁体   中英

Finding combinations of stems and endings

I have mappings of "stems" and "endings" (may not be the correct words) that look like so:

all_endings = {
 'birth': set(['place', 'day', 'mark']), 
 'snow': set(['plow', 'storm', 'flake', 'man']),
 'shoe': set(['lace', 'string', 'maker']),
 'lock': set(['down', 'up', 'smith']),
 'crack': set(['down', 'up',]),
 'arm': set(['chair']),
 'high': set(['chair']),
 'over': set(['charge']),
 'under': set(['charge']),
}

But much longer, of course. I also made the corresponding dictionary the other way around:

all_stems = {
 'chair': set(['high', 'arm']),
 'charge': set(['over', 'under']),
 'up': set(['lock', 'crack', 'vote']),
 'down': set(['lock', 'crack', 'fall']),
 'smith': set(['lock']),
 'place': set(['birth']),
 'day': set(['birth']),
 'mark': set(['birth']),
 'plow': set(['snow']),
 'storm': set(['snow']),
 'flake': set(['snow']),
 'man': set(['snow']),
 'lace': set(['shoe']),
 'string': set(['shoe']),
 'maker': set(['shoe']),
}

I've now tried to come up with an algorithm to find any match of two or more "stems" that match two or more "endings". Above, for example, it would match down and up with lock and crack, resulting in

lockdown
lockup
crackdown
crackup

But not including 'upvote', 'downfall' or 'locksmith' (and it's this that causes me the biggest problems). I get false positives like:

pancake
cupcake
cupboard

But I'm just going round in "loops". (Pun intended) and I don't seem to get anywhere. I'd appreciate any kick in the right direction.

Confused and useless code so far, which you probably should just ignore:

findings = defaultdict(set)
for stem, endings in all_endings.items():
    # What stems have matching endings:
    for ending in endings:
        otherstems = all_stems[ending]
        if not otherstems:
            continue
        for otherstem in otherstems:
            # Find endings that also exist for other stems
            otherendings = all_endings[otherstem].intersection(endings)
            if otherendings:
                # Some kind of match
                findings[stem].add(otherstem)

# Go through this in order of what is the most stems that match:

MINMATCH = 2
for match in sorted(findings.values(), key=len, reverse=True):
    for this_stem in match:
        other_stems = set() # Stems that have endings in common with this_stem
        other_endings = set() # Endings this stem have in common with other stems
        this_endings = all_endings[this_stem]
        for this_ending in this_endings:
            for other_stem in all_stems[this_ending] - set([this_stem]):
                matching_endings = this_endings.intersection(all_endings[other_stem])
                if matching_endings:
                    other_endings.add(this_ending)
                    other_stems.add(other_stem)

        stem_matches = all_stems[other_endings.pop()]
        for other in other_endings:
            stem_matches = stem_matches.intersection(all_stems[other])

        if len(stem_matches) >= MINMATCH:
            for m in stem_matches:
                for e in all_endings[m]:
                    print(m+e)

It's not particularly pretty, but this is quite straightforward if you break your dictionary down into two lists, and use explicit indices:

all_stems = {
 'chair' : set(['high', 'arm']),
 'charge': set(['over', 'under']),
 'fall'  : set(['down', 'water', 'night']),
 'up'    : set(['lock', 'crack', 'vote']),
 'down'  : set(['lock', 'crack', 'fall']),
}

endings     = all_stems.keys()
stem_sets   = all_stems.values()

i = 0
for target_stem_set in stem_sets:
    i += 1
    j  = 0

    remaining_stems = stem_sets[i:]
    for remaining_stem_set in remaining_stems:
        j += 1
        union = target_stem_set & remaining_stem_set
        if len(union) > 1:
            print "%d matches found" % len(union)
            for stem in union:
                print "%s%s" % (stem, endings[i-1])
                print "%s%s" % (stem, endings[j+i-1])

Output:

$ python stems_and_endings.py 
2 matches found
lockdown
lockup
crackdown
crackup

Basically all we're doing is iterating through each set in turn, and comparing it with every remaining set to see if there are more than two matches. We never have to try sets that fall earlier than the current set, because they've already been compared in a prior iteration. The rest (indexing, etc.) is just book-keeping.

I think that the way I avoid those false positives is by removing candidates with no words in the intersection of stems - If this make sense :(

Please have a look and please let me know if I am missing something.

#using all_stems and all_endings from the question

#this function is declared at the end of this answer
two_or_more_stem_combinations = get_stem_combinations(all_stems)
print "two_or_more_stem_combinations", two_or_more_stem_combinations
#this print shows ... [set(['lock', 'crack'])] 

for request in two_or_more_stem_combinations:
    #we filter the initial index to only look for sets or words in the request
    candidates = filter(lambda x: x[0] in request, all_endings.items())

    #intersection of the words for the request
    words = candidates[0][1]
    for c in  candidates[1:]:
        words=words.intersection(c[1])

    #it's handy to have it in a dict
    candidates = dict(candidates)

    #we need to remove those that do not contain 
    #any words after the intersection of stems of all the candidates
    candidates_to_remove = set()
    for c in candidates.items():
        if len(c[1].intersection(words)) == 0:
        candidates_to_remove.add(c[0])

    for key in candidates_to_remove:
        del candidates[key]

    #now we know what to combine
    for c in candidates.keys():
       print "combine", c , "with", words 

Output :

combine lock with set(['down', 'up'])

combine crack with set(['down', 'up'])

As you can see this solution doesn't contain those false positives.

Edit: complexity

And the complexity of this solution doesn't get worst than O(3n) in the worst scenario - without taking into account accessing dictionaries. And for most executions the first filter narrows down quite a lot the solution space.

Edit: getting the stems

This function basically explores recursively the dictionary all_stems and finds the combinations of two or more endings for which two or more stems coincide.

def get_stems_recursive(stems,partial,result,at_least=2):
    if len(partial) >= at_least:
        stem_intersect=all_stems[partial[0]]
        for x in partial[1:]:
           stem_intersect = stem_intersect.intersection(all_stems[x])
           if len(stem_intersect) < 2:
               return
        result.append(stem_intersect)

    for i in range(len(stems)):
        remaining = stems[i+1:]
        get_stems_recursive(remaining,partial + [stems[i][0]],result)


def get_stem_combinations(all_stems,at_least=2):
    result = []
    get_stems_recursive(all_stems.items(),list(),result)
    return result

two_or_more_stem_combinations = get_stem_combinations(all_stems)

== Edited answer: ==

Well, here's another iteration for your consideration with the mistakes I made the first time addressed. Actually the result is code that is even shorter and simpler. The doc for combinations says that "if the input elements are unique, there will be no repeat values in each combination", so it should only be forming and testing the minimum number of intersections. It also appears that determining endings_by_stems isn't necessary.

from itertools import combinations

MINMATCH = 2
print 'all words with at least', MINMATCH, 'endings in common:'
for (word0,word1) in combinations(stems_by_endings, 2):
    ending_words0 = stems_by_endings[word0]
    ending_words1 = stems_by_endings[word1]
    common_endings = ending_words0 & ending_words1
    if len(common_endings) >= MINMATCH:
        for stem in common_endings:
            print ' ', stem+word0
            print ' ', stem+word1

# all words with at least 2 endings in common:
#   lockdown
#   lockup
#   falldown
#   fallup
#   crackdown
#   crackup

== Previous answer ==

I haven't attempted much optimizing, but here's a somewhat brute-force -- but short -- approach that first calculates 'ending_sets' for each stem word, and then finds all the stem words that have common ending_sets with at least the specified minimum number of common endings.

In the final phase it prints out all the possible combinations of these stem + ending words it has detected that have meet the criteria. I tried to make all variable names as descriptive as possible to make it easy to follow. ;-) I've also left out the definitions of all_endings' and 'all+stems .

from collections import defaultdict
from itertools import combinations

ending_sets = defaultdict(set)
for stem in all_stems:
    # create a set of all endings that have this as stem
    for ending in all_endings:
        if stem in all_endings[ending]:
            ending_sets[stem].add(ending)


MINMATCH = 2
print 'all words with at least', MINMATCH, 'endings in common:'
for (word0,word1) in combinations(ending_sets, 2):
    ending_words0 = ending_sets[word0]
    ending_words1 = ending_sets[word1]
    if len(ending_words0) >= MINMATCH and ending_words0 == ending_words1:
        for stem in ending_words0:
            print ' ', stem+word0
            print ' ', stem+word1

# output
# all words with at least 2 endings in common:
#   lockup
#   lockdown
#   crackup
#   crackdown

If you represent your stemming relationships in a square binary arrays (where 1 means "x can follow y", for instance, and where other elements are set to 0), what you are trying to do is equivalent to looking for "broken rectangles" filled with ones:

       ... lock  **0  crack  **1  ...

...    ...
down   ...  1     0     1     1
up     ...  1     1     1     1
...    ...

Here, lock , crack , and **1 (example word) can be matched with down and up (but not word **0 ). The stemming relationships draw a 2x3 rectangle filled with ones.

Hope this helps!

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM