简体   繁体   English

NLTK:从字符串中提取实体名称

[英]NLTK: Extract the entity name from a string

Python and NLTK noob here. Python和NLTK新手在这里。 Messing around with something. 乱七八糟的东西。

I have a string which contains text from a pdf document and I'm trying to extract entity names using the nltk library 我有一个字符串,其中包含pdf文档中的文本,我正在尝试使用nltk库提取实体名称

with open(filename, 'r') as f:
    str_output = f.readlines()   

str_output = clean_str(str(str_output))

sentences = nltk.sent_tokenize(str_output)
tokenized_sentences = [nltk.word_tokenize(sentence) for sentence in sentences]
tagged_sentences = [nltk.pos_tag(sentence) for sentence in tokenized_sentences]
chunked_sentences = nltk.ne_chunk_sents(tagged_sentences, binary=True)

I went through the steps of importing the data, cleaning the string, and preprocessing the strings. 我经历了导入数据,清理字符串和预处理字符串的步骤。 How does one go about getting different entity names from the string? 如何从字符串中获取不同的实体名称?

This should work: 这应该工作:

import nltk

with open('sample.txt', 'r') as f:
    sample = f.read()

sentences = nltk.sent_tokenize(sample)
tokenized_sentences = [nltk.word_tokenize(sentence) for sentence in sentences]
tagged_sentences = [nltk.pos_tag(sentence) for sentence in tokenized_sentences]
chunked_sentences = nltk.ne_chunk_sents(tagged_sentences, binary=True)

def extract_entity_names(t):
    entity_names = []

    if hasattr(t, 'node') and t.node:
        if t.node == 'NE':
            entity_names.append(' '.join([child[0] for child in t]))
        else:
            for child in t:
                entity_names.extend(extract_entity_names(child))

    return entity_names

entity_names = []
for tree in chunked_sentences:
    # Print results per sentence
    # print extract_entity_names(tree)

    entity_names.extend(extract_entity_names(tree))

# Print all entity names
#print entity_names

# Print unique entity names
print set(entity_names)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM