NLTK關系提取不返回任何內容

Question

我最近正致力於使用nltk從文本中提取關系。 所以我建立了一個示例文本：“湯姆是微軟的聯合創始人。” 並使用以下程序測試並返回任何內容。 我無法弄清楚為什么。

我使用的是NLTK版本：3.2.1，python版本：3.5.2。

這是我的代碼：

import re
import nltk
from nltk.sem.relextract import extract_rels, rtuple
from nltk.tokenize import sent_tokenize, word_tokenize


def test():
    with open('sample.txt', 'r') as f:
        sample = f.read()   # "Tom is the cofounder of Microsoft"

    sentences = sent_tokenize(sample)
    tokenized_sentences = [word_tokenize(sentence) for sentence in sentences]
    tagged_sentences = [nltk.tag.pos_tag(sentence) for sentence in tokenized_sentences]

    OF = re.compile(r'.*\bof\b.*')

    for i, sent in enumerate(tagged_sentences):
        sent = nltk.chunk.ne_chunk(sent) # ne_chunk method expects one tagged sentence
        rels = extract_rels('PER', 'GPE', sent, corpus='ace', pattern=OF, window=10) 
        for rel in rels:
            print('{0:<5}{1}'.format(i, rtuple(rel)))

if __name__ == '__main__':
    test()

1.經過一些調試，如果發現我改變了輸入為

“蓋茨於1955年10月28日出生在華盛頓州西雅圖。”

nltk.chunk.ne_chunk（）輸出是：

（S（PERSON Gates / NNS）/ / VBD出生/ VBN in / IN（GPE Seattle / NNP），/，（GPE Washington / NNP）/ IN 10月/ NNP 28 / CD，/，1955 / CD ./。）

test（）返回：

[PER：'蓋茨/ NNS']'/ VBD出生/ VBN in / IN'[GPE：'Seattle / NNP']

2.我將輸入更改為：

“蓋茨於1955年10月28日出生在西雅圖。”

測試（）沒有任何回報。

我挖到nltk / sem / relextract.py並發現這很奇怪

輸出是由函數引起的： semi_rel2reldict（pairs，window = 5，trace = False），僅當len（pairs）> 2時才返回結果，這就是為什么當一個少於三個NE的句子將返回None時。

這是一個錯誤還是我錯誤地使用了NLTK？

Answer 1

首先，對於帶有ne_chunk網元，這個成語看起來就像這樣

>>> from nltk import ne_chunk, pos_tag, word_tokenize
>>> text = "Tom is the cofounder of Microsoft"
>>> chunked = ne_chunk(pos_tag(word_tokenize(text)))
>>> chunked
Tree('S', [Tree('PERSON', [('Tom', 'NNP')]), ('is', 'VBZ'), ('the', 'DT'), ('cofounder', 'NN'), ('of', 'IN'), Tree('ORGANIZATION', [('Microsoft', 'NNP')])])

（另請參閱https://stackoverflow.com/a/31838373/610569 ）

接下來讓我們看一下extract_rels函數。

def extract_rels(subjclass, objclass, doc, corpus='ace', pattern=None, window=10):
    """
    Filter the output of ``semi_rel2reldict`` according to specified NE classes and a filler pattern.
    The parameters ``subjclass`` and ``objclass`` can be used to restrict the
    Named Entities to particular types (any of 'LOCATION', 'ORGANIZATION',
    'PERSON', 'DURATION', 'DATE', 'CARDINAL', 'PERCENT', 'MONEY', 'MEASURE').
    """

當你喚起這個功能時：

extract_rels('PER', 'GPE', sent, corpus='ace', pattern=OF, window=10)

它按順序執行4個過程。

1.它檢查你的`subjclass`和`objclass`是否有效

即https://github.com/nltk/nltk/blob/develop/nltk/sem/relextract.py#L202 ：

if subjclass and subjclass not in NE_CLASSES[corpus]:
    if _expand(subjclass) in NE_CLASSES[corpus]:
        subjclass = _expand(subjclass)
    else:
        raise ValueError("your value for the subject type has not been recognized: %s" % subjclass)
if objclass and objclass not in NE_CLASSES[corpus]:
    if _expand(objclass) in NE_CLASSES[corpus]:
        objclass = _expand(objclass)
    else:
        raise ValueError("your value for the object type has not been recognized: %s" % objclass)

2.它從您的NE標記輸入中提取“對”：

if corpus == 'ace' or corpus == 'conll2002':
    pairs = tree2semi_rel(doc)
elif corpus == 'ieer':
    pairs = tree2semi_rel(doc.text) + tree2semi_rel(doc.headline)
else:
    raise ValueError("corpus type not recognized")

現在讓我們看看你輸入的句子Tom is the cofounder of Microsoft tree2semi_rel()返回什么：

>>> from nltk.sem.relextract import tree2semi_rel, semi_rel2reldict
>>> from nltk import word_tokenize, pos_tag, ne_chunk
>>> text = "Tom is the cofounder of Microsoft"
>>> chunked = ne_chunk(pos_tag(word_tokenize(text)))
>>> tree2semi_rel(chunked)
[[[], Tree('PERSON', [('Tom', 'NNP')])], [[('is', 'VBZ'), ('the', 'DT'), ('cofounder', 'NN'), ('of', 'IN')], Tree('ORGANIZATION', [('Microsoft', 'NNP')])]]

因此它返回一個包含2個列表的列表，第一個內部列表由空白列表和包含“PERSON”標記的Tree組成。

[[], Tree('PERSON', [('Tom', 'NNP')])]

第二個列表包含短語is the cofounder of和包含“組織”的Tree 。

讓我們繼續。

3. `extract_rel`然后嘗試將對更改為某種關系字典

reldicts = semi_rel2reldict(pairs)

如果我們看看semi_rel2reldict函數返回的是你的例句，我們會看到這是空列表返回的地方：

>>> tree2semi_rel(chunked)
[[[], Tree('PERSON', [('Tom', 'NNP')])], [[('is', 'VBZ'), ('the', 'DT'), ('cofounder', 'NN'), ('of', 'IN')], Tree('ORGANIZATION', [('Microsoft', 'NNP')])]]
>>> semi_rel2reldict(tree2semi_rel(chunked))
[]

那么讓我們看看semi_rel2reldict的代碼https://github.com/nltk/nltk/blob/develop/nltk/sem/relextract.py#L144 ：

def semi_rel2reldict(pairs, window=5, trace=False):
    """
    Converts the pairs generated by ``tree2semi_rel`` into a 'reldict': a dictionary which
    stores information about the subject and object NEs plus the filler between them.
    Additionally, a left and right context of length =< window are captured (within
    a given input sentence).
    :param pairs: a pair of list(str) and ``Tree``, as generated by
    :param window: a threshold for the number of items to include in the left and right context
    :type window: int
    :return: 'relation' dictionaries whose keys are 'lcon', 'subjclass', 'subjtext', 'subjsym', 'filler', objclass', objtext', 'objsym' and 'rcon'
    :rtype: list(defaultdict)
    """
    result = []
    while len(pairs) > 2:
        reldict = defaultdict(str)
        reldict['lcon'] = _join(pairs[0][0][-window:])
        reldict['subjclass'] = pairs[0][1].label()
        reldict['subjtext'] = _join(pairs[0][1].leaves())
        reldict['subjsym'] = list2sym(pairs[0][1].leaves())
        reldict['filler'] = _join(pairs[1][0])
        reldict['untagged_filler'] = _join(pairs[1][0], untag=True)
        reldict['objclass'] = pairs[1][1].label()
        reldict['objtext'] = _join(pairs[1][1].leaves())
        reldict['objsym'] = list2sym(pairs[1][1].leaves())
        reldict['rcon'] = _join(pairs[2][0][:window])
        if trace:
            print("(%s(%s, %s)" % (reldict['untagged_filler'], reldict['subjclass'], reldict['objclass']))
        result.append(reldict)
        pairs = pairs[1:]
    return result

semi_rel2reldict()所做的第一件事是檢查tree2semi_rel()的輸出中有多於2個元素的位置，你的例句不是：

>>> tree2semi_rel(chunked)
[[[], Tree('PERSON', [('Tom', 'NNP')])], [[('is', 'VBZ'), ('the', 'DT'), ('cofounder', 'NN'), ('of', 'IN')], Tree('ORGANIZATION', [('Microsoft', 'NNP')])]]
>>> len(tree2semi_rel(chunked))
2
>>> len(tree2semi_rel(chunked)) > 2
False

啊哈，這就是為什么extract_rel什么也沒回來。

現在問題是如何使extract_rel()返回一些東西，即使是來自tree2semi_rel() 2個元素？ 這甚至可能嗎？

讓我們嘗試一個不同的句子：

>>> text = "Tom is the cofounder of Microsoft and now he is the founder of Marcohard"
>>> chunked = ne_chunk(pos_tag(word_tokenize(text)))
>>> chunked
Tree('S', [Tree('PERSON', [('Tom', 'NNP')]), ('is', 'VBZ'), ('the', 'DT'), ('cofounder', 'NN'), ('of', 'IN'), Tree('ORGANIZATION', [('Microsoft', 'NNP')]), ('and', 'CC'), ('now', 'RB'), ('he', 'PRP'), ('is', 'VBZ'), ('the', 'DT'), ('founder', 'NN'), ('of', 'IN'), Tree('PERSON', [('Marcohard', 'NNP')])])
>>> tree2semi_rel(chunked)
[[[], Tree('PERSON', [('Tom', 'NNP')])], [[('is', 'VBZ'), ('the', 'DT'), ('cofounder', 'NN'), ('of', 'IN')], Tree('ORGANIZATION', [('Microsoft', 'NNP')])], [[('and', 'CC'), ('now', 'RB'), ('he', 'PRP'), ('is', 'VBZ'), ('the', 'DT'), ('founder', 'NN'), ('of', 'IN')], Tree('PERSON', [('Marcohard', 'NNP')])]]
>>> len(tree2semi_rel(chunked)) > 2
True
>>> semi_rel2reldict(tree2semi_rel(chunked))
[defaultdict(<type 'str'>, {'lcon': '', 'untagged_filler': 'is the cofounder of', 'filler': 'is/VBZ the/DT cofounder/NN of/IN', 'objsym': 'microsoft', 'objclass': 'ORGANIZATION', 'objtext': 'Microsoft/NNP', 'subjsym': 'tom', 'subjclass': 'PERSON', 'rcon': 'and/CC now/RB he/PRP is/VBZ the/DT', 'subjtext': 'Tom/NNP'})]

但是這只能確認當tree2semi_rel返回<2對時， extract_rel無法提取。如果我們刪除while len(pairs) > 2條件，會發生什么？

為什么我們不能做while len(pairs) > 1 ？

如果我們仔細研究代碼，我們會看到最后一行填充reldict， https ： //github.com/nltk/nltk/blob/develop/nltk/sem/relextract.py#L169 ：

reldict['rcon'] = _join(pairs[2][0][:window])

它試圖訪問的第三個要素pairs ，如果長度pairs為2，你會得到一個IndexError 。

那么如果我們刪除那個rcon密鑰並簡單地將其更改為while len(pairs) >= 2什么？

要做到這一點，我們必須覆蓋semi_rel2redict()函數：

>>> from nltk.sem.relextract import _join, list2sym
>>> from collections import defaultdict
>>> def semi_rel2reldict(pairs, window=5, trace=False):
...     """
...     Converts the pairs generated by ``tree2semi_rel`` into a 'reldict': a dictionary which
...     stores information about the subject and object NEs plus the filler between them.
...     Additionally, a left and right context of length =< window are captured (within
...     a given input sentence).
...     :param pairs: a pair of list(str) and ``Tree``, as generated by
...     :param window: a threshold for the number of items to include in the left and right context
...     :type window: int
...     :return: 'relation' dictionaries whose keys are 'lcon', 'subjclass', 'subjtext', 'subjsym', 'filler', objclass', objtext', 'objsym' and 'rcon'
...     :rtype: list(defaultdict)
...     """
...     result = []
...     while len(pairs) >= 2:
...         reldict = defaultdict(str)
...         reldict['lcon'] = _join(pairs[0][0][-window:])
...         reldict['subjclass'] = pairs[0][1].label()
...         reldict['subjtext'] = _join(pairs[0][1].leaves())
...         reldict['subjsym'] = list2sym(pairs[0][1].leaves())
...         reldict['filler'] = _join(pairs[1][0])
...         reldict['untagged_filler'] = _join(pairs[1][0], untag=True)
...         reldict['objclass'] = pairs[1][1].label()
...         reldict['objtext'] = _join(pairs[1][1].leaves())
...         reldict['objsym'] = list2sym(pairs[1][1].leaves())
...         reldict['rcon'] = []
...         if trace:
...             print("(%s(%s, %s)" % (reldict['untagged_filler'], reldict['subjclass'], reldict['objclass']))
...         result.append(reldict)
...         pairs = pairs[1:]
...     return result
... 
>>> text = "Tom is the cofounder of Microsoft"
>>> chunked = ne_chunk(pos_tag(word_tokenize(text)))
>>> tree2semi_rel(chunked)
[[[], Tree('PERSON', [('Tom', 'NNP')])], [[('is', 'VBZ'), ('the', 'DT'), ('cofounder', 'NN'), ('of', 'IN')], Tree('ORGANIZATION', [('Microsoft', 'NNP')])]]
>>> semi_rel2reldict(tree2semi_rel(chunked))
[defaultdict(<type 'str'>, {'lcon': '', 'untagged_filler': 'is the cofounder of', 'filler': 'is/VBZ the/DT cofounder/NN of/IN', 'objsym': 'microsoft', 'objclass': 'ORGANIZATION', 'objtext': 'Microsoft/NNP', 'subjsym': 'tom', 'subjclass': 'PERSON', 'rcon': [], 'subjtext': 'Tom/NNP'})]

啊! 它有效，但在extract_rels()還有第四步。

4.它給出了你提供給`pattern`參數的正則表達式的reldict過濾器， https ： //github.com/nltk/nltk/blob/develop/nltk/sem/relextract.py#L222 ：

relfilter = lambda x: (x['subjclass'] == subjclass and
                       len(x['filler'].split()) <= window and
                       pattern.match(x['filler']) and
                       x['objclass'] == objclass)

現在讓我們嘗試使用被破解的semi_rel2reldict版本：

>>> text = "Tom is the cofounder of Microsoft"
>>> chunked = ne_chunk(pos_tag(word_tokenize(text)))
>>> tree2semi_rel(chunked)
[[[], Tree('PERSON', [('Tom', 'NNP')])], [[('is', 'VBZ'), ('the', 'DT'), ('cofounder', 'NN'), ('of', 'IN')], Tree('ORGANIZATION', [('Microsoft', 'NNP')])]]
>>> semi_rel2reldict(tree2semi_rel(chunked))
[defaultdict(<type 'str'>, {'lcon': '', 'untagged_filler': 'is the cofounder of', 'filler': 'is/VBZ the/DT cofounder/NN of/IN', 'objsym': 'microsoft', 'objclass': 'ORGANIZATION', 'objtext': 'Microsoft/NNP', 'subjsym': 'tom', 'subjclass': 'PERSON', 'rcon': [], 'subjtext': 'Tom/NNP'})]
>>> 
>>> pattern = re.compile(r'.*\bof\b.*')
>>> reldicts = semi_rel2reldict(tree2semi_rel(chunked))
>>> relfilter = lambda x: (x['subjclass'] == subjclass and
...                            len(x['filler'].split()) <= window and
...                            pattern.match(x['filler']) and
...                            x['objclass'] == objclass)
>>> relfilter
<function <lambda> at 0x112e591b8>
>>> subjclass = 'PERSON'
>>> objclass = 'ORGANIZATION'
>>> window = 5
>>> list(filter(relfilter, reldicts))
[defaultdict(<type 'str'>, {'lcon': '', 'untagged_filler': 'is the cofounder of', 'filler': 'is/VBZ the/DT cofounder/NN of/IN', 'objsym': 'microsoft', 'objclass': 'ORGANIZATION', 'objtext': 'Microsoft/NNP', 'subjsym': 'tom', 'subjclass': 'PERSON', 'rcon': [], 'subjtext': 'Tom/NNP'})]

有用！ 現在讓我們以元組形式看到它：

>>> from nltk.sem.relextract import rtuple
>>> rels = list(filter(relfilter, reldicts))
>>> for rel in rels:
...     print rtuple(rel)
... 
[PER: 'Tom/NNP'] 'is/VBZ the/DT cofounder/NN of/IN' [ORG: 'Microsoft/NNP']

Answer 2

alvas的解決方案效果非常好！ 雖然稍作修改：而不是寫作

>>> for rel in rels:
...     print rtuple(rel)

請用

>>> for rel in rels:
...    print (rtuple(rel))

- 無法添加評論

NLTK關系提取不返回任何內容

問題描述

1.經過一些調試，如果發現我改變了輸入為

nltk.chunk.ne_chunk（）輸出是：

test（）返回：

2.我將輸入更改為：

我挖到nltk / sem / relextract.py並發現這很奇怪

2 個解決方案

解決方案1
6 已采納 2016-11-09 01:11:02

1.它檢查你的`subjclass`和`objclass`是否有效

2.它從您的NE標記輸入中提取“對”：

3. `extract_rel`然后嘗試將對更改為某種關系字典

4.它給出了你提供給`pattern`參數的正則表達式的reldict過濾器， https ： //github.com/nltk/nltk/blob/develop/nltk/sem/relextract.py#L222 ：

解決方案2
-1 2019-03-07 13:08:24

NLTK關系提取不返回任何內容

問題描述

1.經過一些調試，如果發現我改變了輸入為

nltk.chunk.ne_chunk（）輸出是：

test（）返回：

2.我將輸入更改為：

我挖到nltk / sem / relextract.py並發現這很奇怪

2 個解決方案

解決方案1 6 已采納 2016-11-09 01:11:02

1.它檢查你的subjclass和objclass是否有效

2.它從您的NE標記輸入中提取“對”：

3. extract_rel然后嘗試將對更改為某種關系字典

4.它給出了你提供給pattern參數的正則表達式的reldict過濾器， https ： //github.com/nltk/nltk/blob/develop/nltk/sem/relextract.py#L222 ：

解決方案2 -1 2019-03-07 13:08:24

解決方案1
6 已采納 2016-11-09 01:11:02

1.它檢查你的`subjclass`和`objclass`是否有效

3. `extract_rel`然后嘗試將對更改為某種關系字典

4.它給出了你提供給`pattern`參數的正則表達式的reldict過濾器， https ： //github.com/nltk/nltk/blob/develop/nltk/sem/relextract.py#L222 ：

解決方案2
-1 2019-03-07 13:08:24