简体   繁体   中英

exracting words using nltk

from the website http://nltk.googlecode.com/svn/trunk/doc/book/ch05.html i've come to know about splitting tagged words from a tagged corpus.

The code in the website:

>>> sent = '''
... The/AT grand/JJ jury/NN commented/VBD on/IN a/AT number/NN of/IN
... other/AP topics/NNS ,/, AMONG/IN them/PPO the/AT Atlanta/NP and/CC
... Fulton/NP-tl County/NN-tl purchasing/VBG departments/NNS which/WDT it/PPS
... said/VBD ``/`` ARE/BER well/QL operated/VBN and/CC follow/VB generally/RB
... accepted/VBN practices/NNS which/WDT inure/VB to/IN the/AT best/JJT
... interest/NN of/IN both/ABX governments/NNS ''/'' ./.
... '''
>>> [nltk.tag.str2tuple(t) for t in sent.split()]
  [('The', 'AT'), ('grand', 'JJ'), ('jury', 'NN'), ('commented', 'VBD'),
  ('on', 'IN'), ('a', 'AT'), ('number', 'NN'), ... ('.', '.')]

here i get a list of tagged words. What i want is a list containing only the words. For example:

  [('The'), ('grand'), ('jury')...

instead of

  ('The', 'AT'), ('grand', 'JJ'), ('jury', 'NN')...

Any suggestion how can i obtain this?

Thanks in advance.

I'm not an nltk expert but you can directly pick the first tuple element with:

[nltk.tag.str2tuple(t)[0] for t in sent.split()]

That will give you a list of all the words:

['The', 'grand', 'jury'...

What you're asking is a little confusing, becuase in your output example every element is wrapped inside a 1-tuple, I don't really see the point for that.

Edit: Even though as larsman pointed out: ('The',) would be a 1-tuple, while ('The') == 'The' .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM