简体   繁体   English

解析文本以获取专有名词(名称和组织)-python nltk

[英]Parse text to get the proper nouns (names and organizations) - python nltk

I am trying to extract proper nouns as in Names and Organization names from very small chunks of texts like sms, the basic parsers available with nltk Finding Proper Nouns using NLTK WordNet are being able to get the nouns but the problem is when we get proper nouns not starting with a capital letter , for texts like this the names like sumit do not get recognized as proper nouns我正在尝试从非常小的文本块(如 sms)中提取专有名词,如名称和组织名称中的专有名词,nltk 提供的基本解析器使用 NLTK WordNet 查找专有名词能够获得名词,但问题是当我们获得专有名词时以大写字母开头,对于这样的文本,像 sumit 这样的名字不会被识别为专有名词

>>> sentence = "i spoke with sumit and rajesh and Samit about the gridlock situation last night @ around 8 pm last nite"
>>> tagged_sent = pos_tag(sentence.split())
>>> print tagged_sent
[('i', 'PRP'), ('spoke', 'VBP'), ('with', 'IN'), **('sumit', 'NN')**, ('and', 'CC'), ('rajesh', 'JJ'), ('and', 'CC'), **('Samit', 'NNP'),** ('about', 'IN'), ('the', 'DT'), ('gridlock', 'NN'), ('situation', 'NN'), ('last', 'JJ'), ('night', 'NN'), ('@', 'IN'), ('around', 'IN'), ('8', 'CD'), ('pm', 'NN'), ('last', 'JJ'), ('nite', 'NN')]

There is a better way to extract names of people and organizations有更好的方法来提取人员和组织的名称

from nltk import pos_tag, ne_chunk
from nltk.tokenize import SpaceTokenizer

tokenizer = SpaceTokenizer()
toks = tokenizer.tokenize(sentence)
pos = pos_tag(toks)
chunked_nes = ne_chunk(pos) 

nes = [' '.join(map(lambda x: x[0], ne.leaves())) for ne in chunked_nes if isinstance(ne, nltk.tree.Tree)]

However all Named Entity Recognizers commit errors.然而,所有命名实体识别器都会出错。 If you really don't want to miss any proper name, you could use a dict of Proper Names and check if the name is contained in the dict.如果你真的不想错过任何专有名称,你可以使用专有名称的字典并检查名称是否包含在字典中。

You might want to have a look at python-nameparser .您可能想看看python-nameparser It tries to guess capitalization of names also.它也试图猜测名称的大小写 Sorry for the incomplete answer but I don't have much experience using python-nameparser.抱歉回答不完整,但我没有太多使用 python-nameparser 的经验。

Best of luck!祝你好运!

try this code试试这个代码

def get_entities(self,args):
    qry = "who is Mahatma Gandhi"
    tokens = nltk.tokenize.word_tokenize(qry)
    pos = nltk.pos_tag(tokens)
    sentt = nltk.ne_chunk(pos, binary = False)
    print sentt
    person = []
    for subtree in sentt.subtrees(filter=lambda t: t.node == 'PERSON'):
        for leave in subtree.leaves():
            person.append(leave)
    print "person=", person

You can get names of person, organization, locations with the help of this ne_chunk() function.在 ne_chunk() 函数的帮助下,您可以获取人员、组织和地点的名称。 Hope it helps.希望它有帮助。 Thankz谢谢

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM