简体   繁体   中英

how to extract tokens from list of strings where its hard to find the patterns

I am building a model from resume database and I want to extract just the name of degree from candidates' resume. My initial approach was to find a pattern and extract the match with regex, but as there was no apparent pattern, my second approach was to use nlp and see if any label matches my desired string. I also considered if any API or python library has been made, which has all the possible degree names,but no success. The following are some of the strings:

'bachelor of Computer Science Engineering University : Anna Un'
'master of Information Technology University : Deakin Univer'
'diploma in Management 2016 M.Sc. of Computer Science (“Diplo']
'master of Analytics Concentration: Data handling and manage'
'master of Engineering (Software) University of Melbourne 20'
'bachelor of B USINESS INFOR MATIO N SY STEM S – Monash Univer'

However, I have already extracted first two words and standardized them in masters, bachelors and diploma, if this helps, as these are in different formats like masters in, masters of etc. Below is the snapshot of the data to get some idea. Thanks在此处输入图片说明

I have done this using Spacy library. There are two ways to do so, you can look into spacy documentation:

  1. Rule based (pattern based)
  2. Custom NER training for your specific use case.

You can chose the one of the above.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM