简体   繁体   English

如何从难以找到模式的字符串列表中提取标记

[英]how to extract tokens from list of strings where its hard to find the patterns

I am building a model from resume database and I want to extract just the name of degree from candidates' resume.我正在从简历数据库构建模型,我只想从候选人的简历中提取学位名称。 My initial approach was to find a pattern and extract the match with regex, but as there was no apparent pattern, my second approach was to use nlp and see if any label matches my desired string.我最初的方法是找到一个模式并使用正则表达式提取匹配项,但由于没有明显的模式,我的第二种方法是使用 nlp 并查看是否有任何标签与我想要的字符串匹配。 I also considered if any API or python library has been made, which has all the possible degree names,but no success.我还考虑过是否已经制作了任何 API 或 python 库,其中包含所有可能的学位名称,但没有成功。 The following are some of the strings:以下是一些字符串:

'bachelor of Computer Science Engineering University : Anna Un'
'master of Information Technology University : Deakin Univer'
'diploma in Management 2016 M.Sc. of Computer Science (“Diplo']
'master of Analytics Concentration: Data handling and manage'
'master of Engineering (Software) University of Melbourne 20'
'bachelor of B USINESS INFOR MATIO N SY STEM S – Monash Univer'

However, I have already extracted first two words and standardized them in masters, bachelors and diploma, if this helps, as these are in different formats like masters in, masters of etc. Below is the snapshot of the data to get some idea.但是,我已经提取了前两个词并将它们标准化为硕士、学士和文凭,如果这有帮助的话,因为它们采用不同的格式,如硕士、硕士等。以下是数据快照以获取一些想法。 Thanks谢谢在此处输入图片说明

I have done this using Spacy library.我已经使用 Spacy 库完成了这项工作。 There are two ways to do so, you can look into spacy documentation:有两种方法可以这样做,您可以查看 spacy 文档:

  1. Rule based (pattern based)基于规则(基于模式)
  2. Custom NER training for your specific use case.针对您的特定用例的自定义 NER 培训。

You can chose the one of the above.您可以选择上述之一。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM