简体   繁体   中英

NLTK Named Entity recognition for a column in a dataset

Thanks to "alvas" code from here , Named Entity Recognition with Regular Expression: NLTK and as an example:

from nltk import ne_chunk, pos_tag
from nltk.tokenize import word_tokenize
from nltk.tree import Tree

def get_continuous_chunks(text):
    chunked = ne_chunk(pos_tag(word_tokenize(text)))
    prev = None
    continuous_chunk = []
    current_chunk = []

    for i in chunked:
        if type(i) == Tree:
            current_chunk.append(" ".join([token for token, pos in i.leaves()]))
        elif current_chunk:
            named_entity = " ".join(current_chunk)
            if named_entity not in continuous_chunk:
                continuous_chunk.append(named_entity)
                current_chunk = []
        else:
            continue

    return continuous_chunk

txt = 'The new GOP era in Washington got off to a messy start Tuesday as House Republicans,under pressure from President-elect Donald Trump.'
print (get_continuous_chunks(txt))

the output is :

['GOP', 'Washington', 'House Republicans', 'Donald Trump']

I replaced this text with this : txt = df['content'][38] from my dataset and I get this result :

['Ina', 'Tori K.', 'Martin Cuilla', 'Phillip K', 'John J Lavorato']

This dataset has many rows and one column named 'content'.My question is how can I use this code to extract names from this column for each row and store that names in another column and corresponding rows?

import os
from nltk.tag import StanfordNERTagger
from nltk.tokenize import word_tokenize
from nltk.tree import Tree
st = StanfordNERTagger(stanford_classifier, stanford_ner_path, encoding='utf-8')
text = df['content']
tokenized_text = word_tokenize(text)
classified_text = st.tag(tokenized_text)
print (classified_text)

Try apply :

df['ne'] = df['content'].apply(get_continuous_chunks)

For the code in your second example, create a function and apply it the same way:

def my_st(text):
    tokenized_text = word_tokenize(text)
    return st.tag(tokenized_text)

df['st'] = df['content'].apply(my_st)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM