如何在 Python 中修复这个 n-gram 提取器？

Question

我制作了一个 n-gram 提取器，可以从文本中提取组织的名称。 但是，程序只提取第一个单词和最后一个单词的第一个字母。 例如，如果文本中出现短语"Sprint International Corporation" ，程序将返回"s corporation"作为 n-gram。 你知道我做错了什么吗？ 我已经在下面发布了代码和输出。 谢谢。

这是 n-gram 提取器的代码。

def org_ngram(classified_text):
    orgs = [c for c in classified_text if (c[1]=="ORGANIZATION")]
    #print(orgs)

    combined_orgs = []
    prev_org = False
    new_org = ("", "ORGANIZATION")
    for i in range(len(classified_text)):
        if classified_text[i][1] != "ORGANIZATION":
            prev_org = False
        else:
            if prev_org:
                new_org = new_org[0] + " " + classified_text[i][0].lower()
            else:
                combined_orgs.append(new_org)
                new_org = classified_text[i][0].lower()
            prev_org = True

    combined_orgs.append(new_org)
    combined_orgs = combined_orgs[1:]
    return combined_orgs

这是我分析的文本和我用来分析它的程序。

from nltk.tag import StanfordNERTagger
from nltk.tokenize import word_tokenize

st = StanfordNERTagger('C:\\path\\english.all.3class.distsim.crf.ser.gz',
                       'C:\\Users\\path\\stanford-ner.jar',
                       encoding='utf-8')

text = "Trump met with representatives from Sprint International Corporation, Nike Inc, and Wal-Mart Company regarding the trade war."

tokenized_text = word_tokenize(text)
classified_text = st.tag(tokenized_text)
orgs = org_ngram(classified_text)

print(orgs)

这是当前的输出。

['s corporation', 'n inc', 'w company']

这就是我想要输出的样子。

['sprint international corporation', 'nike inc', 'wal-mart company']

Answer 1

首先，避免使用StanfordNERTagger ，它很快就会被弃用。 使用它代替斯坦福解析器和 NLTK

>>> from nltk.parse import CoreNLPParser

# Lexical Parser
>>> parser = CoreNLPParser(url='http://localhost:9000')

>>> ner_tagger = CoreNLPParser(url='http://localhost:9000', tagtype='ner')
>>> list(ner_tagger.tag(('Rami Eid is studying at Stony Brook University in NY'.split())))
[('Rami', 'PERSON'), ('Eid', 'PERSON'), ('is', 'O'), ('studying', 'O'), ('at', 'O'), ('Stony', 'ORGANIZATION'), ('Brook', 'ORGANIZATION'), ('University', 'ORGANIZATION'), ('in', 'O'), ('NY', 'STATE_OR_PROVINCE')]

获得带有令牌和 NER 标签的元组列表后，您要实现的任务是在给定特定标签类型的元组列表中获取连续的令牌标签项，您可以尝试https://的解决方案stackoverflow.com/a/30666949/610569

from nltk import pos_tag
from nltk.chunk import conlltags2tree
from nltk.tree import Tree

def stanfordNE2BIO(tagged_sent):
    bio_tagged_sent = []
    prev_tag = "O"
    for token, tag in tagged_sent:
        if tag == "O": #O
            bio_tagged_sent.append((token, tag))
            prev_tag = tag
            continue
        if tag != "O" and prev_tag == "O": # Begin NE
            bio_tagged_sent.append((token, "B-"+tag))
            prev_tag = tag
        elif prev_tag != "O" and prev_tag == tag: # Inside NE
            bio_tagged_sent.append((token, "I-"+tag))
            prev_tag = tag
        elif prev_tag != "O" and prev_tag != tag: # Adjacent NE
            bio_tagged_sent.append((token, "B-"+tag))
            prev_tag = tag

    return bio_tagged_sent


def stanfordNE2tree(ne_tagged_sent):
    bio_tagged_sent = stanfordNE2BIO(ne_tagged_sent)
    sent_tokens, sent_ne_tags = zip(*bio_tagged_sent)
    sent_pos_tags = [pos for token, pos in pos_tag(sent_tokens)]

    sent_conlltags = [(token, pos, ne) for token, pos, ne in zip(sent_tokens, sent_pos_tags, sent_ne_tags)]
    ne_tree = conlltags2tree(sent_conlltags)
    return ne_tree

def extract_ner(ne_tagged_sent):
    ne_tree = stanfordNE2tree(ne_tagged_sent)

    ne_in_sent = []
    for subtree in ne_tree:
        if type(subtree) == Tree: # If subtree is a noun chunk, i.e. NE != "O"
            ne_label = subtree.label()
            ne_string = " ".join([token for token, pos in subtree.leaves()])
            ne_in_sent.append((ne_string, ne_label))
    return ne_in_sent

然后：

ne_tagged_sent = [('Rami', 'PERSON'), ('Eid', 'PERSON'), ('is', 'O'), 
('studying', 'O'), ('at', 'O'), ('Stony', 'ORGANIZATION'), 
('Brook', 'ORGANIZATION'), ('University', 'ORGANIZATION'), 
('in', 'O'), ('NY', 'LOCATION')]

print(extract_ner(ne_tagged_sent))

[出去]：

[('Rami Eid', 'PERSON'), ('Stony Brook University', 'ORGANIZATION'), ('NY', 'LOCATION')]

如何在 Python 中修复这个 n-gram 提取器？

问题描述

1 个解决方案

解决方案1
0 2020-02-01 00:06:04

如何在 Python 中修复这个 n-gram 提取器？

问题描述

1 个解决方案

解决方案1 0 2020-02-01 00:06:04

解决方案1
0 2020-02-01 00:06:04