简体   繁体   English

NLTK关系提取不返回任何内容

[英]NLTK relation extraction returns nothing

I am recently working on using nltk to extract relation from text. 我最近正致力于使用nltk从文本中提取关系。 so i build a sample text:" Tom is the cofounder of Microsoft." 所以我建立了一个示例文本:“汤姆是微软的联合创始人。” and using following program to test and return nothing. 并使用以下程序测试并返回任何内容。 I cannot figure out why. 我无法弄清楚为什么。

I'm using NLTK version: 3.2.1, python version: 3.5.2. 我使用的是NLTK版本:3.2.1,python版本:3.5.2。

Here is my code: 这是我的代码:

import re
import nltk
from nltk.sem.relextract import extract_rels, rtuple
from nltk.tokenize import sent_tokenize, word_tokenize


def test():
    with open('sample.txt', 'r') as f:
        sample = f.read()   # "Tom is the cofounder of Microsoft"

    sentences = sent_tokenize(sample)
    tokenized_sentences = [word_tokenize(sentence) for sentence in sentences]
    tagged_sentences = [nltk.tag.pos_tag(sentence) for sentence in tokenized_sentences]

    OF = re.compile(r'.*\bof\b.*')

    for i, sent in enumerate(tagged_sentences):
        sent = nltk.chunk.ne_chunk(sent) # ne_chunk method expects one tagged sentence
        rels = extract_rels('PER', 'GPE', sent, corpus='ace', pattern=OF, window=10) 
        for rel in rels:
            print('{0:<5}{1}'.format(i, rtuple(rel)))

if __name__ == '__main__':
    test()

1. After some debug, if found that when i changed the input as 1.经过一些调试,如果发现我改变了输入为

"Gates was born in Seattle, Washington on October 28, 1955. " “盖茨于1955年10月28日出生在华盛顿州西雅图。”

the nltk.chunk.ne_chunk() output is: nltk.chunk.ne_chunk()输出是:

(S (PERSON Gates/NNS) was/VBD born/VBN in/IN (GPE Seattle/NNP) ,/, (GPE Washington/NNP) on/IN October/NNP 28/CD ,/, 1955/CD ./.) (S(PERSON Gates / NNS)/ / VBD出生/ VBN in / IN(GPE Seattle / NNP),/,(GPE Washington / NNP)/ IN 10月/ NNP 28 / CD,/,1955 / CD ./。 )

The test() returns: test()返回:

[PER: 'Gates/NNS'] 'was/VBD born/VBN in/IN' [GPE: 'Seattle/NNP'] [PER:'盖茨/ NNS']'/ VBD出生/ VBN in / IN'[GPE:'Seattle / NNP']

2. After i changed the input as: 2.我将输入更改为:

"Gates was born in Seattle on October 28, 1955. " “盖茨于1955年10月28日出生在西雅图。”

The test() retuns nothing. 测试()没有任何回报。

3. I digged into nltk/sem/relextract.py and find this strange 挖到nltk / sem / relextract.py并发现这很奇怪

output is caused by function: semi_rel2reldict(pairs, window=5, trace=False), which returns result only when len(pairs) > 2, and that's why when one sentence with less than three NEs will return None. 输出是由函数引起的: semi_rel2reldict(pairs,window = 5,trace = False),仅当len(pairs)> 2时才返回结果,这就是为什么当一个少于三个NE的句子将返回N​​one时。

Is this a bug or i used NLTK in wrong way? 这是一个错误还是我错误地使用了NLTK?

Firstly, to chunk NEs with ne_chunk , the idiom would look something like this 首先,对于带有ne_chunk网元,这个成语看起来就像这样

>>> from nltk import ne_chunk, pos_tag, word_tokenize
>>> text = "Tom is the cofounder of Microsoft"
>>> chunked = ne_chunk(pos_tag(word_tokenize(text)))
>>> chunked
Tree('S', [Tree('PERSON', [('Tom', 'NNP')]), ('is', 'VBZ'), ('the', 'DT'), ('cofounder', 'NN'), ('of', 'IN'), Tree('ORGANIZATION', [('Microsoft', 'NNP')])])

(see also https://stackoverflow.com/a/31838373/610569 ) (另请参阅https://stackoverflow.com/a/31838373/610569

Next let's look at the extract_rels function . 接下来让我们看一下extract_rels函数

def extract_rels(subjclass, objclass, doc, corpus='ace', pattern=None, window=10):
    """
    Filter the output of ``semi_rel2reldict`` according to specified NE classes and a filler pattern.
    The parameters ``subjclass`` and ``objclass`` can be used to restrict the
    Named Entities to particular types (any of 'LOCATION', 'ORGANIZATION',
    'PERSON', 'DURATION', 'DATE', 'CARDINAL', 'PERCENT', 'MONEY', 'MEASURE').
    """

When you evoke this function: 当你唤起这个功能时:

extract_rels('PER', 'GPE', sent, corpus='ace', pattern=OF, window=10)

It performs 4 processes sequentially. 它按顺序执行4个过程。

1. It checks whether your subjclass and objclass are valid 1.它检查你的subjclassobjclass是否有效

ie https://github.com/nltk/nltk/blob/develop/nltk/sem/relextract.py#L202 : https://github.com/nltk/nltk/blob/develop/nltk/sem/relextract.py#L202

if subjclass and subjclass not in NE_CLASSES[corpus]:
    if _expand(subjclass) in NE_CLASSES[corpus]:
        subjclass = _expand(subjclass)
    else:
        raise ValueError("your value for the subject type has not been recognized: %s" % subjclass)
if objclass and objclass not in NE_CLASSES[corpus]:
    if _expand(objclass) in NE_CLASSES[corpus]:
        objclass = _expand(objclass)
    else:
        raise ValueError("your value for the object type has not been recognized: %s" % objclass)

2. It extracts "pairs" from your NE tagged inputs: 2.它从您的NE标记输入中提取“对”:

if corpus == 'ace' or corpus == 'conll2002':
    pairs = tree2semi_rel(doc)
elif corpus == 'ieer':
    pairs = tree2semi_rel(doc.text) + tree2semi_rel(doc.headline)
else:
    raise ValueError("corpus type not recognized")

Now let's see given your input sentence Tom is the cofounder of Microsoft , what does tree2semi_rel() returns: 现在让我们看看你输入的句子Tom is the cofounder of Microsoft tree2semi_rel()返回什么:

>>> from nltk.sem.relextract import tree2semi_rel, semi_rel2reldict
>>> from nltk import word_tokenize, pos_tag, ne_chunk
>>> text = "Tom is the cofounder of Microsoft"
>>> chunked = ne_chunk(pos_tag(word_tokenize(text)))
>>> tree2semi_rel(chunked)
[[[], Tree('PERSON', [('Tom', 'NNP')])], [[('is', 'VBZ'), ('the', 'DT'), ('cofounder', 'NN'), ('of', 'IN')], Tree('ORGANIZATION', [('Microsoft', 'NNP')])]]

So it returns a list of 2 lists, the first inner list consist of a blank list and the Tree that contains the "PERSON" tag. 因此它返回一个包含2个列表的列表,第一个内部列表由空白列表和包含“PERSON”标记的Tree组成。

[[], Tree('PERSON', [('Tom', 'NNP')])] 

The second list consist of the phrase is the cofounder of and the Tree that contains "ORGANIZATION". 第二个列表包含短语is the cofounder of和包含“组织”的Tree

Let's move on. 让我们继续。

3. extract_rel then tries to change the pairs to some sort of relation dictionary 3. extract_rel然后尝试将对更改为某种关系字典

reldicts = semi_rel2reldict(pairs)

If we look what the semi_rel2reldict function returns with your example sentence, we see that this is where the empty list gets returns: 如果我们看看semi_rel2reldict函数返回的是你的例句,我们会看到这是空列表返回的地方:

>>> tree2semi_rel(chunked)
[[[], Tree('PERSON', [('Tom', 'NNP')])], [[('is', 'VBZ'), ('the', 'DT'), ('cofounder', 'NN'), ('of', 'IN')], Tree('ORGANIZATION', [('Microsoft', 'NNP')])]]
>>> semi_rel2reldict(tree2semi_rel(chunked))
[]

So let's look into the code of semi_rel2reldict https://github.com/nltk/nltk/blob/develop/nltk/sem/relextract.py#L144 : 那么让我们看看semi_rel2reldict的代码https://github.com/nltk/nltk/blob/develop/nltk/sem/relextract.py#L144

def semi_rel2reldict(pairs, window=5, trace=False):
    """
    Converts the pairs generated by ``tree2semi_rel`` into a 'reldict': a dictionary which
    stores information about the subject and object NEs plus the filler between them.
    Additionally, a left and right context of length =< window are captured (within
    a given input sentence).
    :param pairs: a pair of list(str) and ``Tree``, as generated by
    :param window: a threshold for the number of items to include in the left and right context
    :type window: int
    :return: 'relation' dictionaries whose keys are 'lcon', 'subjclass', 'subjtext', 'subjsym', 'filler', objclass', objtext', 'objsym' and 'rcon'
    :rtype: list(defaultdict)
    """
    result = []
    while len(pairs) > 2:
        reldict = defaultdict(str)
        reldict['lcon'] = _join(pairs[0][0][-window:])
        reldict['subjclass'] = pairs[0][1].label()
        reldict['subjtext'] = _join(pairs[0][1].leaves())
        reldict['subjsym'] = list2sym(pairs[0][1].leaves())
        reldict['filler'] = _join(pairs[1][0])
        reldict['untagged_filler'] = _join(pairs[1][0], untag=True)
        reldict['objclass'] = pairs[1][1].label()
        reldict['objtext'] = _join(pairs[1][1].leaves())
        reldict['objsym'] = list2sym(pairs[1][1].leaves())
        reldict['rcon'] = _join(pairs[2][0][:window])
        if trace:
            print("(%s(%s, %s)" % (reldict['untagged_filler'], reldict['subjclass'], reldict['objclass']))
        result.append(reldict)
        pairs = pairs[1:]
    return result

The first thing that semi_rel2reldict() does is to check where there are more than 2 elements the output from tree2semi_rel() , which your example sentence doesn't: semi_rel2reldict()所做的第一件事是检查tree2semi_rel()的输出中有多于2个元素的位置,你的例句不是:

>>> tree2semi_rel(chunked)
[[[], Tree('PERSON', [('Tom', 'NNP')])], [[('is', 'VBZ'), ('the', 'DT'), ('cofounder', 'NN'), ('of', 'IN')], Tree('ORGANIZATION', [('Microsoft', 'NNP')])]]
>>> len(tree2semi_rel(chunked))
2
>>> len(tree2semi_rel(chunked)) > 2
False

Ah ha, that's why the extract_rel is returning nothing. 啊哈,这就是为什么extract_rel什么也没回来。

Now comes the question of how to make extract_rel() return something even with 2 elements from tree2semi_rel() ? 现在问题是如何使extract_rel()返回一些东西,即使是来自tree2semi_rel() 2个元素? Is that even possible? 这甚至可能吗?

Let's try a different sentence: 让我们尝试一个不同的句子:

>>> text = "Tom is the cofounder of Microsoft and now he is the founder of Marcohard"
>>> chunked = ne_chunk(pos_tag(word_tokenize(text)))
>>> chunked
Tree('S', [Tree('PERSON', [('Tom', 'NNP')]), ('is', 'VBZ'), ('the', 'DT'), ('cofounder', 'NN'), ('of', 'IN'), Tree('ORGANIZATION', [('Microsoft', 'NNP')]), ('and', 'CC'), ('now', 'RB'), ('he', 'PRP'), ('is', 'VBZ'), ('the', 'DT'), ('founder', 'NN'), ('of', 'IN'), Tree('PERSON', [('Marcohard', 'NNP')])])
>>> tree2semi_rel(chunked)
[[[], Tree('PERSON', [('Tom', 'NNP')])], [[('is', 'VBZ'), ('the', 'DT'), ('cofounder', 'NN'), ('of', 'IN')], Tree('ORGANIZATION', [('Microsoft', 'NNP')])], [[('and', 'CC'), ('now', 'RB'), ('he', 'PRP'), ('is', 'VBZ'), ('the', 'DT'), ('founder', 'NN'), ('of', 'IN')], Tree('PERSON', [('Marcohard', 'NNP')])]]
>>> len(tree2semi_rel(chunked)) > 2
True
>>> semi_rel2reldict(tree2semi_rel(chunked))
[defaultdict(<type 'str'>, {'lcon': '', 'untagged_filler': 'is the cofounder of', 'filler': 'is/VBZ the/DT cofounder/NN of/IN', 'objsym': 'microsoft', 'objclass': 'ORGANIZATION', 'objtext': 'Microsoft/NNP', 'subjsym': 'tom', 'subjclass': 'PERSON', 'rcon': 'and/CC now/RB he/PRP is/VBZ the/DT', 'subjtext': 'Tom/NNP'})]

But that only confirms that extract_rel can't extract when tree2semi_rel returns pairs of < 2. What happens if we remove that condition of while len(pairs) > 2 ? 但是这只能确认当tree2semi_rel返回<2对时, extract_rel无法提取。如果我们删除while len(pairs) > 2条件,会发生什么?

Why can't we do while len(pairs) > 1 ? 为什么我们不能做while len(pairs) > 1

If we look closer into the code, we see the last line of populating the reldict, https://github.com/nltk/nltk/blob/develop/nltk/sem/relextract.py#L169 : 如果我们仔细研究代码,我们会看到最后一行填充reldict, https//github.com/nltk/nltk/blob/develop/nltk/sem/relextract.py#L169

reldict['rcon'] = _join(pairs[2][0][:window])

It tries to access a 3rd element of the pairs and if the length of the pairs is 2, you'll get an IndexError . 它试图访问的第三个要素pairs ,如果长度pairs为2,你会得到一个IndexError

So what happens if we remove that rcon key and simply change it to while len(pairs) >= 2 ? 那么如果我们删除那个rcon密钥并简单地将其更改为while len(pairs) >= 2什么?

To do that we have to override the semi_rel2redict() function: 要做到这一点,我们必须覆盖semi_rel2redict()函数:

>>> from nltk.sem.relextract import _join, list2sym
>>> from collections import defaultdict
>>> def semi_rel2reldict(pairs, window=5, trace=False):
...     """
...     Converts the pairs generated by ``tree2semi_rel`` into a 'reldict': a dictionary which
...     stores information about the subject and object NEs plus the filler between them.
...     Additionally, a left and right context of length =< window are captured (within
...     a given input sentence).
...     :param pairs: a pair of list(str) and ``Tree``, as generated by
...     :param window: a threshold for the number of items to include in the left and right context
...     :type window: int
...     :return: 'relation' dictionaries whose keys are 'lcon', 'subjclass', 'subjtext', 'subjsym', 'filler', objclass', objtext', 'objsym' and 'rcon'
...     :rtype: list(defaultdict)
...     """
...     result = []
...     while len(pairs) >= 2:
...         reldict = defaultdict(str)
...         reldict['lcon'] = _join(pairs[0][0][-window:])
...         reldict['subjclass'] = pairs[0][1].label()
...         reldict['subjtext'] = _join(pairs[0][1].leaves())
...         reldict['subjsym'] = list2sym(pairs[0][1].leaves())
...         reldict['filler'] = _join(pairs[1][0])
...         reldict['untagged_filler'] = _join(pairs[1][0], untag=True)
...         reldict['objclass'] = pairs[1][1].label()
...         reldict['objtext'] = _join(pairs[1][1].leaves())
...         reldict['objsym'] = list2sym(pairs[1][1].leaves())
...         reldict['rcon'] = []
...         if trace:
...             print("(%s(%s, %s)" % (reldict['untagged_filler'], reldict['subjclass'], reldict['objclass']))
...         result.append(reldict)
...         pairs = pairs[1:]
...     return result
... 
>>> text = "Tom is the cofounder of Microsoft"
>>> chunked = ne_chunk(pos_tag(word_tokenize(text)))
>>> tree2semi_rel(chunked)
[[[], Tree('PERSON', [('Tom', 'NNP')])], [[('is', 'VBZ'), ('the', 'DT'), ('cofounder', 'NN'), ('of', 'IN')], Tree('ORGANIZATION', [('Microsoft', 'NNP')])]]
>>> semi_rel2reldict(tree2semi_rel(chunked))
[defaultdict(<type 'str'>, {'lcon': '', 'untagged_filler': 'is the cofounder of', 'filler': 'is/VBZ the/DT cofounder/NN of/IN', 'objsym': 'microsoft', 'objclass': 'ORGANIZATION', 'objtext': 'Microsoft/NNP', 'subjsym': 'tom', 'subjclass': 'PERSON', 'rcon': [], 'subjtext': 'Tom/NNP'})]

Ah! 啊! It works but there's still a 4th step in extract_rels() . 它有效,但在extract_rels()还有第四步。

4. It performs a filter of the reldict given the regex you have provided to the pattern parameter, https://github.com/nltk/nltk/blob/develop/nltk/sem/relextract.py#L222 : 4.它给出了你提供给pattern参数的正则表达式的reldict过滤器, https//github.com/nltk/nltk/blob/develop/nltk/sem/relextract.py#L222

relfilter = lambda x: (x['subjclass'] == subjclass and
                       len(x['filler'].split()) <= window and
                       pattern.match(x['filler']) and
                       x['objclass'] == objclass)

Now let's try it with the hacked version of semi_rel2reldict : 现在让我们尝试使用被破解的semi_rel2reldict版本:

>>> text = "Tom is the cofounder of Microsoft"
>>> chunked = ne_chunk(pos_tag(word_tokenize(text)))
>>> tree2semi_rel(chunked)
[[[], Tree('PERSON', [('Tom', 'NNP')])], [[('is', 'VBZ'), ('the', 'DT'), ('cofounder', 'NN'), ('of', 'IN')], Tree('ORGANIZATION', [('Microsoft', 'NNP')])]]
>>> semi_rel2reldict(tree2semi_rel(chunked))
[defaultdict(<type 'str'>, {'lcon': '', 'untagged_filler': 'is the cofounder of', 'filler': 'is/VBZ the/DT cofounder/NN of/IN', 'objsym': 'microsoft', 'objclass': 'ORGANIZATION', 'objtext': 'Microsoft/NNP', 'subjsym': 'tom', 'subjclass': 'PERSON', 'rcon': [], 'subjtext': 'Tom/NNP'})]
>>> 
>>> pattern = re.compile(r'.*\bof\b.*')
>>> reldicts = semi_rel2reldict(tree2semi_rel(chunked))
>>> relfilter = lambda x: (x['subjclass'] == subjclass and
...                            len(x['filler'].split()) <= window and
...                            pattern.match(x['filler']) and
...                            x['objclass'] == objclass)
>>> relfilter
<function <lambda> at 0x112e591b8>
>>> subjclass = 'PERSON'
>>> objclass = 'ORGANIZATION'
>>> window = 5
>>> list(filter(relfilter, reldicts))
[defaultdict(<type 'str'>, {'lcon': '', 'untagged_filler': 'is the cofounder of', 'filler': 'is/VBZ the/DT cofounder/NN of/IN', 'objsym': 'microsoft', 'objclass': 'ORGANIZATION', 'objtext': 'Microsoft/NNP', 'subjsym': 'tom', 'subjclass': 'PERSON', 'rcon': [], 'subjtext': 'Tom/NNP'})]

It works! 有用! Now let's see it in tuple form: 现在让我们以元组形式看到它:

>>> from nltk.sem.relextract import rtuple
>>> rels = list(filter(relfilter, reldicts))
>>> for rel in rels:
...     print rtuple(rel)
... 
[PER: 'Tom/NNP'] 'is/VBZ the/DT cofounder/NN of/IN' [ORG: 'Microsoft/NNP']

alvas' solution works superbly well! alvas的解决方案效果非常好! Minor modification though: instead of writing 虽然稍作修改:而不是写作

>>> for rel in rels:
...     print rtuple(rel)

please use 请用

>>> for rel in rels:
...    print (rtuple(rel))

-unable to add a comment - 无法添加评论

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM