简体   繁体   中英

Match adjacent list elements with a list of tuples in Python

I have an ordered list of individual words from a document, like so:

words = ['apple', 'orange', 'boat', 'car', 'happy', 'day', 'cow', ...]

I have a second list of tuples of significant bigrams/collocations, like so:

bigrams = [('apple', 'orange'), ('happy', 'day'), ('big', 'house'), ...]

I would like to iterate through the list of individual words and replace adjacent words with an underscore-separated bigram, ending up with a list like this:

words_fixed = ['apple_orange', 'boat', 'car', 'happy_day', 'cow', ...]

I've considered flattening words and bigrams into strings ( " ".join(words) , etc.) and then using regex to find and replace the adjacent words, but that seems horribly inefficient and unpythonic.

What's the best way to quickly match and combine adjacent list elements from a list of tuples?

Not as flashy as @inspectorG4dget:

words_fixed = []
last = None
for word in words:
    if (last,word) in bigrams:
        words_fixed.append( "%s_%s" % (last,word) )
        last = None
    else:
        if last:
            words_fixed.append( last )
        last = word
if last:
    words_fixed.append( last )
words = ['apple', 'orange', 'boat', 'car', 'happy', 'day', 'cow', ...]
bigrams = [('apple', 'orange'), ('happy', 'day'), ('big', 'house'), ...]

First, some optimization:

import collections
bigrams = collections.defaultdict(set)
for w1,w2 in bigrams:
    bigrams[w1].add(w2)

Now, onto the fun stuff:

import itertools
words_fixed = []
for w1,w2 in itertools.izip(itertools.islice(words, 0, len(words)), (itertools.islice(words, 1, len(words)))):
    if w1 in bigrams and w2 in bigrams[w1]:
        words_fixed.append("%s_%s" %(w1, w2))

If you want to see words that are not in your bigrams, in addition to the words you've recorded in your bigrams, then this should do the trick:

import itertools
words_fixed = []
for w1,w2 in itertools.izip(itertools.islice(words, 0, len(words)), (itertools.islice(words, 1, len(words)))):
    if w1 in bigrams and w2 in bigrams[w1]:
        words_fixed.append("%s_%s" %(w1, w2))
    else:
        words_fixed.append(w1)
words = ['apple', 'orange', 'boat', 'car', 'happy', 'day', 'cow']
bigrams = [('apple', 'orange'), ('happy', 'day'), ('big', 'house')]

bigrams_dict = dict(item for item in bigrams)
bigrams_dict.update(item[::-1] for item in bigrams)

words_fixed = ["{}_{}".format(word, bigrams_dict[word]) 
    if word in bigrams_dict else word
    for word in words]

[edit] another way to create dictionary:

from itertools import chain
bigrams_rev = (reversed(x) for x in bigrams)
bigrams_dict = dict(chain(bigrams, bigrams_rev))
words = ['apple', 'orange', 'boat', 'car', 'happy', 'day', 'cow', 'big']
bigrams = [('apple', 'orange'), ('happy', 'day'), ('big', 'house')]
print 'words   :',words
print 'bigrams :',bigrams
print
def zwii(words,bigrams):
    it = iter(words)
    dict_bigrams = dict(bigrams)
    for x in it:
        if x in dict_bigrams:
            try:
                y = it.next()
                if dict_bigrams[x] == y:
                    yield '-'.join((x,y))
                else:
                    yield x
                    yield y
            except:
                yield x
        else:
            yield x

print list(zwii(words,bigrams))

result

words   : ['apple', 'orange', 'boat', 'car', 'happy', 'day', 'cow', 'big']
bigrams : [('apple', 'orange'), ('happy', 'day'), ('big', 'house')]

['apple-orange', 'boat', 'car', 'happy-day', 'cow', 'big']

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM