简体   繁体   中英

Identifying text using NLP

I'm trying to find the courses in the below line of text using some NLP technique.

from nltk import word_tokenize, pos_tag, ne_chunk
sentence = "SDGI is offering courses like Electronics,Mechatronics, Physics,Mechanical Engineering"    
print ne_chunk(pos_tag(word_tokenize(sentence)))

Out put of this is

(S
  (ORGANIZATION SDGI/NNP)
  is/VBZ
  offering/VBG
  courses/NNS
  like/IN
  Electronics/NNS
  ,/,
  Mechatronics/NNS
  ,/,
  (PERSON Physics/NNPS)
  ,/,
  (PERSON Mechanical/NNP Engineering/NNP))

Is there any way I can extract the courses from this line?

In my real project I will be getting so many documents from which I need to get the course names.

Any help is appreciated!

  1. Extract all the Nouns from a given text.
  2. Create a Bag of Words feature set and train the set for courses with labeled data.
  3. It seems the courses mostly precede or succeed a comma(,). A bigram or trigram approach could give accurate results.

This might be too simplistic, but, if there is is a finite number of existing course names, it might be easier just to create a large look up table, tokenize your input and try to look each word up. There will be some edge cases, but I'm not sure you need to take an ML/NLP approach to this problem.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM