简体   繁体   中英

How to find uncapitalised proper nouns with NLTK?

I'm trying to make a 'fix faulty capitalisation' program, and I'm trying to find proper nouns in python using NLTK's pos tagger. The problem is that it doesn't seem to be working very well for text with faulty/missing capitalisation.

This is the code I have so far:

import nltk

text = "This is My text. Unicorns are very Nice, I think. how do you do? are you okay! testing capitalisation. my nice Friend is called bob he lives in america."

tokenized_words = nltk.word_tokenize(text)
pos_tagged_text = nltk.pos_tag(tokenized_words)
print(pos_tagged_text)

And the output is:

[('This', 'DT'), ('is', 'VBZ'), ('My', 'PRP$'), ('text', 'NN'), ('.', '.'), ('Unicorns', 'NNS'), ('are', 'VBP'), ('very', 'RB'), ('Nice', 'NNP'), (',', ','), ('I', 'PRP'), ('think', 'VBP'), ('.', '.'), ('how', 'WRB'), ('do', 'VB'), ('you', 'PRP'), ('do', 'VB'), ('?', '.'), ('are', 'VBP'), ('you', 'PRP'), ('okay', 'JJ'), ('!', '.'), ('testing', 'VBG'), ('capitalisation', 'NN'), ('.', '.'), ('my', 'PRP$'), ('nice', 'JJ'), ('Friend', 'NNP'), ('is', 'VBZ'), ('called', 'VBN'), ('bob', 'NN'), ('he', 'PRP'), ('lives', 'VBZ'), ('in', 'IN'), ('america', 'NN'), ('.', '.')]

As you can see, there's quite a few mistakes. "Nice" gets tagged as a proper noun, as does "Friend", while "bob" and "america" don't.

How I can find proper nouns regardless of capitalisation?

I recommend using the python library spaCy, their models have great accuracy for part-of-speech tagging. If the casing of the original text isn't reliable, I suggest lower-casing the entire text to reduce false positives.

import spacy

nlp = spacy.load('en_core_web_lg')

text = "This is My text. Unicorns are very Nice, I think. how do you do? are you okay! testing capitalisation. my nice Friend is called bob he lives in america."
doc = nlp(text.lower())
print([tok for tok in doc if tok.pos_=='PROPN'])  # extract all proper nouns

Output:

[bob, america]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM