简体   繁体   English


[英]Merge the first words in a list of word-pairs, depending on the second words in those pairs

I have a program (NLTK-NER) which provides me with this list: 我有一个程序(NLTK-NER)为我提供以下列表:

    ('Barak', 'PERSON'),
    ('Obama', 'PERSON'),
    ('is', 'O'),
    ('the', 'O'),
    ('president', 'O'),
    ('of', 'O'),
    ('United', 'LOCATION'),
    ('States', 'LOCATION'),
    ('of', 'LOCATION'),
    ('America', 'LOCATION')

As you can see "Barak" and "Obama" are words of type "PERSON", and I want to merge them (and words of type "LOCATION") together, like this: 如您所见,“ Barak”和“ Obama”是“ PERSON”类型的单词,我想将它们(以及“ LOCATION”类型的单词)合并在一起,如下所示:

['Barak Obama','is','the','president', 'of','United States of America']

How can I approach this problem? 我该如何解决这个问题?

What we're looking to do here, essentially, is group some items of classified_text together … so it stands to reason that itertools.groupby() can help. 本质上,我们希望在这里进行的工作是将一些classified_text itertools.groupby()组合在一起……因此,有理由相信itertools.groupby()可以提供帮助。 First of all, we need a key function that treats items with the tags 'PERSON' or 'LOCATION' as similar, and all other items as distinct. 首先,我们需要一个关键功能,将带有标签'PERSON''LOCATION'视为相似,并将所有其他项目视为不同。

This is slightly complicated by the fact that we need a way to distinguish adjacent items that have the same tag (other than 'PERSON' or 'LOCATION' ), eg ('is', 'O'), ('the', 'O') etc. We can use enumerate() for that: 由于我们需要一种方法来区分具有相同标签(而不是'PERSON''LOCATION' )的相邻项目,例如('is', 'O'), ('the', 'O')等。我们可以为此使用enumerate()

>>> list(enumerate(classified_text))
[..., (2, ('is', 'O')), (3, ('the', 'O')), (4, ('president', 'O')), ...]

Now that we know what we're going to provide as input to groupby() , we can write our key function: 现在我们知道要提供什么作为groupby()输入,我们可以编写关键函数:

def person_or_location(item):
    index, (word, tag) = item
    if tag in {'PERSON', 'LOCATION'}:
        return tag
        return index

Notice that the structure of index, (word, tag) in the assignment matches the structure of each item in our enumerated list. 请注意,赋值中的index, (word, tag)的结构与我们枚举列表中每个项目的结构相匹配。

Once we've got that, we can write another function to do the actual merging: 一旦知道了这一点,就可以编写另一个函数来进行实际的合并:

from itertools import groupby

def merge(tagged_text):
    enumerated_text = enumerate(tagged_text)
    grouped_text = groupby(enumerated_text, person_or_location)
    return [
        ' '.join(word for index, (word, tag) in group)
        for key, group in grouped_text

Here it is in action: 它在起作用:

>>> merge(classified_text)
['Barak Obama', 'is', 'the', 'president', 'of', 'United States of America']

This is the first thing it came to my mind, pretty sure it could be optimised but is a good start. 这是我想到的第一件事,很确定可以对其进行优化,但这是一个好的开始。

    classified_text = [('Barak', 'PERSON'), ('Obama', 'PERSON'), ('is', 'O'), ('the', 'O'), ('president', 'O'), ('of', 'O'), ('United', 'LOCATION'), ('States', 'LOCATION'), ('of', 'LOCATION'), ('America', 'LOCATION')]

    # Reverse the list so it pops the first element
    # Create an aux list to store the result and add the first item
    new_text = [classified_text.pop(), ]
    # Iterate over the text
    while classified_text:
        old_word = new_text[-1]
        new_word = classified_text.pop()

        # If previous word has same type, merge. 
        # Avoid merging 'O' types
        if old_word[1] == new_word[1] and new_word[1] != 'O':
            new_text[-1] = (
                ' '.join((old_word[0], new_word[0])),

        # If not just add the tuple

    # Remove the types from the list and you have your result
    new_text = [x[0] for x in new_text]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

粤ICP备18138465号  © 2020-2024 STACKOOM.COM