简体   繁体   中英

how to get rid of the non alphabetic character at the end of the word using python nltk

I am trying to extract nouns from text using python nltk package. It more or less worked. But I wonder how to get rid of the non-alphabetic character at the end of words? Please see the following example.

from nltk.tag import pos_tag                     
x = "Back, Back: Back"                           
tagged_sent = pos_tag(x.split())
y = [word for word,pos in tagged_sent if pos == 'NNP']

Then y takes value

['Back,', 'Back:', 'Back']

What I really want is

['Back', 'Back', 'Back']
re.findall(r'\w+', x)

instead of

x.split()

(This will give you alphanumeric blocks; if you really want just alphabetic, [a-zA-Z] should be a good start, but that won't deal well with non-English characters even if you specify re.UNICODE ; \\w does.)

Using filter:

>>> my_str = "Back, Back: Back"
>>> [filter(str.isalnum, x) for x in my_str.split()]
['Back', 'Back', 'Back']

Using itertools.takewhile

>>> my_str = "Back, Back: Back"
>>> ["".join(x) for x in map(lambda x:list(itertools.takewhile(str.isalnum, x)), my_str.split())]
['Back', 'Back', 'Back']

you may use re.sub() . Change your last line of code to

import re
y = [re.sub('[^A-Za-z]+$', '', word) for word,pos in tagged_sent if pos == 'NNP']

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM