how to get rid of the non alphabetic character at the end of the word using python nltk

Question

I am trying to extract nouns from text using python nltk package. It more or less worked. But I wonder how to get rid of the non-alphabetic character at the end of words? Please see the following example.

from nltk.tag import pos_tag                     
x = "Back, Back: Back"                           
tagged_sent = pos_tag(x.split())
y = [word for word,pos in tagged_sent if pos == 'NNP']

Then y takes value

['Back,', 'Back:', 'Back']

What I really want is

['Back', 'Back', 'Back']

Answer 1

re.findall(r'\w+', x)

instead of

x.split()

(This will give you alphanumeric blocks; if you really want just alphabetic, [a-zA-Z] should be a good start, but that won't deal well with non-English characters even if you specify re.UNICODE ; \\w does.)

Answer 2

Using filter:

>>> my_str = "Back, Back: Back"
>>> [filter(str.isalnum, x) for x in my_str.split()]
['Back', 'Back', 'Back']

Using itertools.takewhile

>>> my_str = "Back, Back: Back"
>>> ["".join(x) for x in map(lambda x:list(itertools.takewhile(str.isalnum, x)), my_str.split())]
['Back', 'Back', 'Back']

Answer 3

you may use re.sub() . Change your last line of code to

import re
y = [re.sub('[^A-Za-z]+$', '', word) for word,pos in tagged_sent if pos == 'NNP']

how to get rid of the non alphabetic character at the end of the word using python nltk

Question

3 answers

solution1
2 ACCPTED 2016-04-11 05:13:01

solution2
0 2016-04-11 05:22:56

solution3
0 2016-04-11 05:25:33

how to get rid of the non alphabetic character at the end of the word using python nltk

Question

3 answers

solution1 2 ACCPTED 2016-04-11 05:13:01

solution2 0 2016-04-11 05:22:56

solution3 0 2016-04-11 05:25:33

solution1
2 ACCPTED 2016-04-11 05:13:01

solution2
0 2016-04-11 05:22:56

solution3
0 2016-04-11 05:25:33