[英]Match adjacent list elements with a list of tuples in Python
I have an ordered list of individual words from a document, like so: 我有一个文档中各个单词的有序列表,如下所示:
words = ['apple', 'orange', 'boat', 'car', 'happy', 'day', 'cow', ...]
I have a second list of tuples of significant bigrams/collocations, like so: 我还有重要的二元组/搭配的元组列表,如下所示:
bigrams = [('apple', 'orange'), ('happy', 'day'), ('big', 'house'), ...]
I would like to iterate through the list of individual words and replace adjacent words with an underscore-separated bigram, ending up with a list like this: 我想遍历单个单词的列表,并用下划线分隔的双字母组替换相邻的单词,最后得到一个像这样的列表:
words_fixed = ['apple_orange', 'boat', 'car', 'happy_day', 'cow', ...]
I've considered flattening words
and bigrams
into strings ( " ".join(words)
, etc.) and then using regex to find and replace the adjacent words, but that seems horribly inefficient and unpythonic. 我曾考虑过将words
和bigrams
words
成字符串( " ".join(words)
等),然后使用正则表达式来查找和替换相邻的单词,但这似乎效率极低且令人难以置信。
What's the best way to quickly match and combine adjacent list elements from a list of tuples? 快速匹配和组合元组列表中的相邻列表元素的最佳方法是什么?
Not as flashy as @inspectorG4dget: 不像@ inspectorG4dget那样浮华:
words_fixed = []
last = None
for word in words:
if (last,word) in bigrams:
words_fixed.append( "%s_%s" % (last,word) )
last = None
else:
if last:
words_fixed.append( last )
last = word
if last:
words_fixed.append( last )
words = ['apple', 'orange', 'boat', 'car', 'happy', 'day', 'cow', ...]
bigrams = [('apple', 'orange'), ('happy', 'day'), ('big', 'house'), ...]
First, some optimization: 首先,进行一些优化:
import collections
bigrams = collections.defaultdict(set)
for w1,w2 in bigrams:
bigrams[w1].add(w2)
Now, onto the fun stuff: 现在,到有趣的东西上:
import itertools
words_fixed = []
for w1,w2 in itertools.izip(itertools.islice(words, 0, len(words)), (itertools.islice(words, 1, len(words)))):
if w1 in bigrams and w2 in bigrams[w1]:
words_fixed.append("%s_%s" %(w1, w2))
If you want to see words that are not in your bigrams, in addition to the words you've recorded in your bigrams, then this should do the trick: 如果您想查看不在二元组中的单词,除了您在二元组中记录的单词之外,还可以这样做:
import itertools
words_fixed = []
for w1,w2 in itertools.izip(itertools.islice(words, 0, len(words)), (itertools.islice(words, 1, len(words)))):
if w1 in bigrams and w2 in bigrams[w1]:
words_fixed.append("%s_%s" %(w1, w2))
else:
words_fixed.append(w1)
words = ['apple', 'orange', 'boat', 'car', 'happy', 'day', 'cow']
bigrams = [('apple', 'orange'), ('happy', 'day'), ('big', 'house')]
bigrams_dict = dict(item for item in bigrams)
bigrams_dict.update(item[::-1] for item in bigrams)
words_fixed = ["{}_{}".format(word, bigrams_dict[word])
if word in bigrams_dict else word
for word in words]
[edit] another way to create dictionary: [编辑]创建字典的另一种方法:
from itertools import chain
bigrams_rev = (reversed(x) for x in bigrams)
bigrams_dict = dict(chain(bigrams, bigrams_rev))
words = ['apple', 'orange', 'boat', 'car', 'happy', 'day', 'cow', 'big']
bigrams = [('apple', 'orange'), ('happy', 'day'), ('big', 'house')]
print 'words :',words
print 'bigrams :',bigrams
print
def zwii(words,bigrams):
it = iter(words)
dict_bigrams = dict(bigrams)
for x in it:
if x in dict_bigrams:
try:
y = it.next()
if dict_bigrams[x] == y:
yield '-'.join((x,y))
else:
yield x
yield y
except:
yield x
else:
yield x
print list(zwii(words,bigrams))
result 结果
words : ['apple', 'orange', 'boat', 'car', 'happy', 'day', 'cow', 'big']
bigrams : [('apple', 'orange'), ('happy', 'day'), ('big', 'house')]
['apple-orange', 'boat', 'car', 'happy-day', 'cow', 'big']
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.