简体   繁体   中英

NLTK extracting terms of chunker parse tree

John Edward Grey started running now that he knows he is fat

She was listening to smack that by that awful singer

I want to extract interesting terms from a sentence. I currently use POS tagging to identify grammatical types of each entity. Then I update each token to a counter (with different weights for nouns, verbs and adjectives).

I now wish to use a chunker for this. I think the leaf nodes of the parse tree holds all interesting words and phrases . How do I extract the terms from a chunker output?

In linguistics, the "interesting words" are call open class words . And the task you are referring to is not really a chunking/parsing task. You are looking for some sort of tagger/annotator/labeller to tag each word to see whether it is "interesting" or not.

Sequence Labelling

If you approach your task as a sequence labelling task, then the sentence John Edward Grey started running now that he knows he is fat will be tagged as such:

[('John','B'),('Edward','I'),('Grey','I'),('started','O'),('running','B'),
('now','O'),('that','O'),('he','O'),('knows','O'),('he','O'),
('is','O'),('fat','B')]
  • So anything tagged with B means a beginning of your "interesting" chunk and

  • the subsequent word tagged with O will be the end of the "interesting" chunk or

  • it can also end up with a subsequent B to label the end of the previous "interesting" chunk and the start of a new "interesting" chunk.

What is interesting or not?

Actually what is interesting or not depends on what is your ultimate aim of the task, to me, I would have said that started running is an "interesting" chunk because it started modifies the infinitive meaning or running to give it a begin action modality.

Closed class vs Open class words

If you have in mind what are the non-interesting words, then i suggest you build a dictionary of that and then run a sequence labeling script to detect those not in the dictionary of close class words.

Machine learning Approach

Another approach is to perform machine learning classification task where you have already pre-annotated a sample data of what is interesting and what is not. Then you identify some classification features and run the classification to automatically tag the data with B , I , O tags.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM