I am trying to extract nouns from text using python nltk package. It more or less worked. But I wonder how to get rid of the non-alphabetic character at the end of words? Please see the following example.
from nltk.tag import pos_tag
x = "Back, Back: Back"
tagged_sent = pos_tag(x.split())
y = [word for word,pos in tagged_sent if pos == 'NNP']
Then y takes value
['Back,', 'Back:', 'Back']
What I really want is
['Back', 'Back', 'Back']
re.findall(r'\w+', x)
instead of
x.split()
(This will give you alphanumeric blocks; if you really want just alphabetic, [a-zA-Z]
should be a good start, but that won't deal well with non-English characters even if you specify re.UNICODE
; \\w
does.)
Using filter:
>>> my_str = "Back, Back: Back"
>>> [filter(str.isalnum, x) for x in my_str.split()]
['Back', 'Back', 'Back']
Using itertools.takewhile
>>> my_str = "Back, Back: Back"
>>> ["".join(x) for x in map(lambda x:list(itertools.takewhile(str.isalnum, x)), my_str.split())]
['Back', 'Back', 'Back']
you may use re.sub() . Change your last line of code to
import re
y = [re.sub('[^A-Za-z]+$', '', word) for word,pos in tagged_sent if pos == 'NNP']
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.