简体   繁体   English

在Python中将相邻列表元素与元组列表匹配

[英]Match adjacent list elements with a list of tuples in Python

I have an ordered list of individual words from a document, like so: 我有一个文档中各个单词的有序列表,如下所示:

words = ['apple', 'orange', 'boat', 'car', 'happy', 'day', 'cow', ...]

I have a second list of tuples of significant bigrams/collocations, like so: 我还有重要的二元组/搭配的元组列表,如下所示:

bigrams = [('apple', 'orange'), ('happy', 'day'), ('big', 'house'), ...]

I would like to iterate through the list of individual words and replace adjacent words with an underscore-separated bigram, ending up with a list like this: 我想遍历单个单词的列表,并用下划线分隔的双字母组替换相邻的单词,最后得到一个像这样的列表:

words_fixed = ['apple_orange', 'boat', 'car', 'happy_day', 'cow', ...]

I've considered flattening words and bigrams into strings ( " ".join(words) , etc.) and then using regex to find and replace the adjacent words, but that seems horribly inefficient and unpythonic. 我曾考虑过将wordsbigrams words成字符串( " ".join(words)等),然后使用正则表达式来查找和替换相邻的单词,但这似乎效率极低且令人难以置信。

What's the best way to quickly match and combine adjacent list elements from a list of tuples? 快速匹配和组合元组列表中的相邻列表元素的最佳方法是什么?

Not as flashy as @inspectorG4dget: 不像@ inspectorG4dget那样浮华:

words_fixed = []
last = None
for word in words:
    if (last,word) in bigrams:
        words_fixed.append( "%s_%s" % (last,word) )
        last = None
    else:
        if last:
            words_fixed.append( last )
        last = word
if last:
    words_fixed.append( last )
words = ['apple', 'orange', 'boat', 'car', 'happy', 'day', 'cow', ...]
bigrams = [('apple', 'orange'), ('happy', 'day'), ('big', 'house'), ...]

First, some optimization: 首先,进行一些优化:

import collections
bigrams = collections.defaultdict(set)
for w1,w2 in bigrams:
    bigrams[w1].add(w2)

Now, onto the fun stuff: 现在,到有趣的东西上:

import itertools
words_fixed = []
for w1,w2 in itertools.izip(itertools.islice(words, 0, len(words)), (itertools.islice(words, 1, len(words)))):
    if w1 in bigrams and w2 in bigrams[w1]:
        words_fixed.append("%s_%s" %(w1, w2))

If you want to see words that are not in your bigrams, in addition to the words you've recorded in your bigrams, then this should do the trick: 如果您想查看不在二元组中的单词,除了您在二元组中记录的单词之外,还可以这样做:

import itertools
words_fixed = []
for w1,w2 in itertools.izip(itertools.islice(words, 0, len(words)), (itertools.islice(words, 1, len(words)))):
    if w1 in bigrams and w2 in bigrams[w1]:
        words_fixed.append("%s_%s" %(w1, w2))
    else:
        words_fixed.append(w1)
words = ['apple', 'orange', 'boat', 'car', 'happy', 'day', 'cow']
bigrams = [('apple', 'orange'), ('happy', 'day'), ('big', 'house')]

bigrams_dict = dict(item for item in bigrams)
bigrams_dict.update(item[::-1] for item in bigrams)

words_fixed = ["{}_{}".format(word, bigrams_dict[word]) 
    if word in bigrams_dict else word
    for word in words]

[edit] another way to create dictionary: [编辑]创建字典的另一种方法:

from itertools import chain
bigrams_rev = (reversed(x) for x in bigrams)
bigrams_dict = dict(chain(bigrams, bigrams_rev))
words = ['apple', 'orange', 'boat', 'car', 'happy', 'day', 'cow', 'big']
bigrams = [('apple', 'orange'), ('happy', 'day'), ('big', 'house')]
print 'words   :',words
print 'bigrams :',bigrams
print
def zwii(words,bigrams):
    it = iter(words)
    dict_bigrams = dict(bigrams)
    for x in it:
        if x in dict_bigrams:
            try:
                y = it.next()
                if dict_bigrams[x] == y:
                    yield '-'.join((x,y))
                else:
                    yield x
                    yield y
            except:
                yield x
        else:
            yield x

print list(zwii(words,bigrams))

result 结果

words   : ['apple', 'orange', 'boat', 'car', 'happy', 'day', 'cow', 'big']
bigrams : [('apple', 'orange'), ('happy', 'day'), ('big', 'house')]

['apple-orange', 'boat', 'car', 'happy-day', 'cow', 'big']

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM