简体   繁体   中英

Merge the first words in a list of word-pairs, depending on the second words in those pairs

I have a program (NLTK-NER) which provides me with this list:

[
    ('Barak', 'PERSON'),
    ('Obama', 'PERSON'),
    ('is', 'O'),
    ('the', 'O'),
    ('president', 'O'),
    ('of', 'O'),
    ('United', 'LOCATION'),
    ('States', 'LOCATION'),
    ('of', 'LOCATION'),
    ('America', 'LOCATION')
]

As you can see "Barak" and "Obama" are words of type "PERSON", and I want to merge them (and words of type "LOCATION") together, like this:

['Barak Obama','is','the','president', 'of','United States of America']

How can I approach this problem?

What we're looking to do here, essentially, is group some items of classified_text together … so it stands to reason that itertools.groupby() can help. First of all, we need a key function that treats items with the tags 'PERSON' or 'LOCATION' as similar, and all other items as distinct.

This is slightly complicated by the fact that we need a way to distinguish adjacent items that have the same tag (other than 'PERSON' or 'LOCATION' ), eg ('is', 'O'), ('the', 'O') etc. We can use enumerate() for that:

>>> list(enumerate(classified_text))
[..., (2, ('is', 'O')), (3, ('the', 'O')), (4, ('president', 'O')), ...]

Now that we know what we're going to provide as input to groupby() , we can write our key function:

def person_or_location(item):
    index, (word, tag) = item
    if tag in {'PERSON', 'LOCATION'}:
        return tag
    else:
        return index

Notice that the structure of index, (word, tag) in the assignment matches the structure of each item in our enumerated list.

Once we've got that, we can write another function to do the actual merging:

from itertools import groupby

def merge(tagged_text):
    enumerated_text = enumerate(tagged_text)
    grouped_text = groupby(enumerated_text, person_or_location)
    return [
        ' '.join(word for index, (word, tag) in group)
        for key, group in grouped_text
    ]

Here it is in action:

>>> merge(classified_text)
['Barak Obama', 'is', 'the', 'president', 'of', 'United States of America']

This is the first thing it came to my mind, pretty sure it could be optimised but is a good start.

    classified_text = [('Barak', 'PERSON'), ('Obama', 'PERSON'), ('is', 'O'), ('the', 'O'), ('president', 'O'), ('of', 'O'), ('United', 'LOCATION'), ('States', 'LOCATION'), ('of', 'LOCATION'), ('America', 'LOCATION')]

    # Reverse the list so it pops the first element
    classified_text.reverse()
    # Create an aux list to store the result and add the first item
    new_text = [classified_text.pop(), ]
    # Iterate over the text
    while classified_text:
        old_word = new_text[-1]
        new_word = classified_text.pop()

        # If previous word has same type, merge. 
        # Avoid merging 'O' types
        if old_word[1] == new_word[1] and new_word[1] != 'O':
            new_text[-1] = (
                ' '.join((old_word[0], new_word[0])),
                new_word[1],
            )

        # If not just add the tuple
        else:
            new_text.append(new_word)

    # Remove the types from the list and you have your result
    new_text = [x[0] for x in new_text]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM