合並單詞對列表中的第一個單詞，具體取決於這些對中的第二個單詞

Question

我有一個程序（NLTK-NER）為我提供以下列表：

[
    ('Barak', 'PERSON'),
    ('Obama', 'PERSON'),
    ('is', 'O'),
    ('the', 'O'),
    ('president', 'O'),
    ('of', 'O'),
    ('United', 'LOCATION'),
    ('States', 'LOCATION'),
    ('of', 'LOCATION'),
    ('America', 'LOCATION')
]

如您所見，“ Barak”和“ Obama”是“ PERSON”類型的單詞，我想將它們（以及“ LOCATION”類型的單詞）合並在一起，如下所示：

['Barak Obama','is','the','president', 'of','United States of America']

我該如何解決這個問題？

Answer 1

本質上，我們希望在這里進行的工作是將一些classified_text itertools.groupby()組合在一起……因此，有理由相信itertools.groupby()可以提供幫助。 首先，我們需要一個關鍵功能，將帶有標簽'PERSON'或'LOCATION'視為相似，並將所有其他項目視為不同。

由於我們需要一種方法來區分具有相同標簽（而不是'PERSON'或'LOCATION' ）的相鄰項目，例如('is', 'O'), ('the', 'O')等。我們可以為此使用enumerate() ：

>>> list(enumerate(classified_text))
[..., (2, ('is', 'O')), (3, ('the', 'O')), (4, ('president', 'O')), ...]

現在我們知道要提供什么作為groupby()輸入，我們可以編寫關鍵函數：

def person_or_location(item):
    index, (word, tag) = item
    if tag in {'PERSON', 'LOCATION'}:
        return tag
    else:
        return index

請注意，賦值中的index, (word, tag)的結構與我們枚舉列表中每個項目的結構相匹配。

一旦知道了這一點，就可以編寫另一個函數來進行實際的合並：

from itertools import groupby

def merge(tagged_text):
    enumerated_text = enumerate(tagged_text)
    grouped_text = groupby(enumerated_text, person_or_location)
    return [
        ' '.join(word for index, (word, tag) in group)
        for key, group in grouped_text
    ]

它在起作用：

>>> merge(classified_text)
['Barak Obama', 'is', 'the', 'president', 'of', 'United States of America']

Answer 2

這是我想到的第一件事，很確定可以對其進行優化，但這是一個好的開始。

    classified_text = [('Barak', 'PERSON'), ('Obama', 'PERSON'), ('is', 'O'), ('the', 'O'), ('president', 'O'), ('of', 'O'), ('United', 'LOCATION'), ('States', 'LOCATION'), ('of', 'LOCATION'), ('America', 'LOCATION')]

    # Reverse the list so it pops the first element
    classified_text.reverse()
    # Create an aux list to store the result and add the first item
    new_text = [classified_text.pop(), ]
    # Iterate over the text
    while classified_text:
        old_word = new_text[-1]
        new_word = classified_text.pop()

        # If previous word has same type, merge. 
        # Avoid merging 'O' types
        if old_word[1] == new_word[1] and new_word[1] != 'O':
            new_text[-1] = (
                ' '.join((old_word[0], new_word[0])),
                new_word[1],
            )

        # If not just add the tuple
        else:
            new_text.append(new_word)

    # Remove the types from the list and you have your result
    new_text = [x[0] for x in new_text]

合並單詞對列表中的第一個單詞，具體取決於這些對中的第二個單詞

問題描述

2 個解決方案

解決方案1
2 已采納 2017-01-11 00:01:49

解決方案2
1 2017-01-11 00:00:48

合並單詞對列表中的第一個單詞，具體取決於這些對中的第二個單詞

問題描述

2 個解決方案

解決方案1 2 已采納 2017-01-11 00:01:49

解決方案2 1 2017-01-11 00:00:48

解決方案1
2 已采納 2017-01-11 00:01:49

解決方案2
1 2017-01-11 00:00:48