简体   繁体   English

按给定顺序检查列表的超集

[英]Checking superset of list in given order

I have a list of tuples in format (float,string) sorted in descending order.我有一个按降序排序的格式 (float,string) 元组列表。

print sent_scores
[(0.10507038451969995,'Deadly stampede in Shanghai - Emergency personnel help victims.'),
 (0.078586381821416265,'Deadly stampede in Shanghai - Police and medical staff help injured people after the stampede.'),
 (0.072031446647399661, '- Emergency personnel help victims.')]

In case there two cases in the list which four words same in continuinty.如果列表中有两个案例四个单词连续相同。 I want to remove the tuple with lesser score from the list.我想从列表中删除分数较低的元组。 The new list should also preserve order.新列表还应保持顺序。

The output of above:上面的输出:

[(0.10507038451969995,'Deadly stampede in Shanghai - Emergency personnel help victims.')]

This will be first certainly involve tokenization of the words, which can be done the code below:这首先肯定涉及单词的标记化,这可以通过以下代码完成:

from nltk.tokenize import TreebankWordTokenizer

def tokenize_words(text):
    tokens = TreebankWordTokenizer().tokenize(text)
    contractions = ["n't", "'ll", "'m","'s"]
    fix = []
    for i in range(len(tokens)):
        for c in contractions:
            if tokens[i] == c: fix.append(i)
    fix_offset = 0
    for fix_id in fix:
        idx = fix_id - 1 - fix_offset
        tokens[idx] = tokens[idx] + tokens[idx+1]
        del tokens[idx+1]
        fix_offset += 1
    return tokens
 tokenized_sents=[tokenize_words(sentence) for score,sentence in sent_scores]

I earlier tried to convert the words of each sentences in groups of 4 contained a set and then use issuperset for other sentences.我之前尝试将每个句子的单词转换为包含一个集合的 4 组,然后将 issuperset 用于其他句子。 But it doesn't check continuity then.但它不会检查连续性。

I suggest taking sequences of 4 tokens in a row from your tokenized list, and making a set of those tokens.我建议从标记化列表中连续获取 4 个标记的序列,并制作一组这些标记。 By using Python's itertools module, this can be done rather elegantly:通过使用 Python 的itertools模块,这可以相当优雅地完成:

my_list = ['The', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog']
i1 = itertools.islice(my_list, 0, None)
i2 = itertools.islice(my_list, 1, None)
i3 = itertools.islice(my_list, 2, None)
i4 = itertools.islice(my_list, 3, None)
print zip(i1, i2, i3, i4)

Output of the above code (nicely formatted for you):上面代码的输出(格式很好):

[('The', 'quick', 'brown', 'fox'),
 ('quick', 'brown', 'fox', 'jumps'),
 ('brown', 'fox', 'jumps', 'over'),
 ('fox', 'jumps', 'over', 'the'),
 ('jumps', 'over', 'the', 'lazy'),
 ('over', 'the', 'lazy', 'dog')]

Actually, even more elegant would be:实际上,更优雅的是:

my_list = ['The', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog']
iterators = [itertools.islice(my_list, x, None) for x in range(4)]
print zip(*iterators)

Same output as before.和以前一样的输出。

Now that you have your list of four consecutive tokens (as 4-tuples) for each list, you can stick those tokens in a set, and check whether the same 4-tuple appears in two different sets:现在您已经为每个列表提供了四个连续标记(作为 4 元组)的列表,您可以将这些标记放在一个集合中,并检查相同的 4 元组是否出现在两个不同的集合中:

my_list = ['The', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog']
set1 = set(zip(*[itertools.islice(my_list, x, None) for x in range(4)]))

other_list = ['The', 'quick', 'red', 'fox', 'goes', 'home']
set2 = set(zip(*[itertools.islice(other_list, x, None) for x in range(4)]))

print set1.intersection(set2) # Empty set
if set1.intersection(set2):
    print "Found something in common"
else:
    print "Nothing in common"
# Prints "Nothing in common"

third_list = ['The', 'quick', 'brown', 'fox', 'goes', 'to', 'school']
set3 = set(zip(*[itertools.islice(third_list, x, None) for x in range(4)]))

print set1.intersection(set3) # Set containing ('The', 'quick', 'brown', 'fox')
if set1.intersection(set3):
    print "Found something in common"
else:
    print "Nothing in common"
# Prints "Found something in common"

NOTE : If you're using Python 3, just replace all the print "Something" statements with print("Something") : in Python 3, print became a function rather than a statement.注意:如果您使用的是 Python 3,只需将所有print "Something"语句替换为print("Something") :在 Python 3 中, print变成了一个函数而不是一个语句。 But if you're using NLTK, I suspect you're using Python 2.但是,如果您使用的是 NLTK,我怀疑您使用的是 Python 2。

IMPORTANT NOTE : Any itertools.islice objects you create will iterate through their original list once , and then become "exhausted" (they've returned all their data, so putting them in a second for loop will produce nothing, and the for loop just won't do anything).重要说明:您创建的任何itertools.islice对象将遍历其原始列表一次,然后变得“耗尽”(它们已返回所有数据,因此将它们放入第二个for循环将不会产生任何结果,而for循环只是不会做任何事情)。 If you want to iterate through the same list multiple times, create multiple iterators (as you see I did in my examples).如果您想多次迭代同一个列表,请创建多个迭代器(正如您在我的示例中所做的那样)。

Update: Here's how to eliminate the lesser-scoring words.更新:这是消除得分较低的单词的方法。 First, replace this line:首先,替换这一行:

tokenized_sents=[tokenize_words(sentence) for score,sentence in sent_scores]

with:和:

tokenized_sents=[(score,tokenize_words(sentence)) for score,sentence in sent_scores]

Now what you have is a list of (score,sentence) tuples.现在你有一个(分数,句子)元组的列表。 Then we'll construct a list called scores_and_sets that will be a list of (score,sets_of_four_words) tuples (where sets_of_four_words is a list of four-word slices like in the example above):然后我们将构建一个名为scores_and_sets的列表,它将是一个(score,sets_of_four_words) 元组列表(其中sets_of_four_words是一个四字切片列表,如上例所示):

scores_and_sentences_and_sets = [(score, sentence, set(zip(*[itertools.islice(sentence, x, None) for x in range(4)]))) for score,sentence in tokenized_sents]

That one-liner may be a bit too clever, actually, so let's unpack it to be a bit more readable:实际上,单行代码可能有点聪明了,所以让我们将其解压缩以提高可读性:

scores_and_sentences_and_sets = []
for score, sentence in tokenized_sents:
    set_of_four_word_groups = set(zip(*[itertools.islice(sentence, x, None) for x in range(4)]))
    score_sentence_and_sets_tuple = (score, sentence, set_of_four_word_groups)
    scores_and_sentences_and_sets.append(score_sentence_and_sets_tuple)

Go ahead and experiment with those two code snippets, and you'll find that they do exactly the same thing.继续试验这两个代码片段,您会发现它们做的事情完全一样。

Okay, so now we have a list of (score, sentence, set_of_four_word_groups) tuples.好的,现在我们有了一个 (score, sentence, set_of_four_word_groups) 元组列表。 So we'll go through the list in order, and build up a result list consisting of ONLY the sentences we want to keep.因此,我们将按顺序浏览列表,并构建一个仅包含我们想要保留的句子的结果列表。 Since the list is already sorted in descending order, that makes things a little easier, because it means that at any point in the list, we only have to look at the items that have already been "accepted" to see if any of them have a duplicate;由于列表已经按降序排序,这让事情变得容易一些,因为这意味着在列表中的任何一点,我们只需要查看已经“接受”的项目,看看它们中是否有任何一个复制品; if any of the accepted items are a duplicate of the one we've just looked at, we don't even need to look at the scores, because we know the accepted item came earlier than the one we're looking at, and therefore it must have a higher score than the one we're looking at.如果任何接受的项目与我们刚刚查看的项目重复,我们甚至不需要查看分数,因为我们知道接受的项目比我们正在查看的项目早,因此它必须比我们正在查看的分数更高。

So here's some code that should do what you want:所以这里有一些代码应该可以做你想做的:

accepted_items = []
for current_tuple in scores_and_sentences_and_sets:
    score, sentence, set_of_four_words = current_tuple
    found = False
    for accepted_tuple in accepted_items:
        accepted_score, accepted_sentence, accepted_set = accepted_tuple
        if set_of_four_words.intersection(accepted_set):
            found = True
            break
    if not found:
        accepted_items.append(current_tuple)
print accepted_items # Prints a whole bunch of tuples
sentences_only = [sentence for score, sentence, word_set in accepted_items]
print sentences_only # Prints just the sentences

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM