简体   繁体   English

使用Python在NLP中命名实体识别

[英]Named Entity Recognition in NLP using Python

I have lots of CVs text documents. 我有很多简历文本文件。 In that, there is different formats of dates are available eg Birthdate - 12-12-1995, Experience-year - 2000 PRESENT or 1995-2005 or 5 years of experience or 1995/2005, Date-of-Joining - 5th March, 2015 etc. From these data I want to extract only years of experience. 可以选择不同的日期格式,例如生日 -1995 12月12日, 经验年 -2000年至今 1995-2005年 5年经验 1995/2005年, 加入日期 -2015年3月5日等。从这些数据中,我只想提取多年的经验。 How can I do this in Python using NLP? 如何使用NLP在Python中执行此操作? Please answer. 请回答。

I have tried with following : 我尝试了以下方法:

#This gives me all the dates from documents
import datefinder
data = open("/home/system/Desktop/samplecv/5c22fcad79fcc1.33753024.txt")
str1 = ''.join(str(e) for e in data)
matches = datefinder.find_dates(str1)
for match in matches:
    print(match)

If you already have extracted the dates then it seems like what you're missing is the "type of date" each is. 如果您已经提取了日期,那么似乎您缺少的是每个日期的“日期类型”。 If datefinder isn't able to keep track of the positional structure of the dates within the corpus then date extraction using it won't be too useful. 如果日期查找器无法跟踪语料库中日期的位置结构,则使用它进行日期提取不会太有用。

However, this isn't just a entity recognition problem. 但是,这不仅仅是一个实体识别问题。 You'll have to pair a NER with a POS tagger (and maybe even a Syntatic Dependency Parser) Spacy is a good one. 您必须将NER与POS标记器(甚至可能是句法依赖性解析器)配对, Spacy是一个很好的选择。

You should first run a POS tagger on your corpus and see whether it picks up phrases like "Experience" or "Work History". 您应该首先在语料库上运行POS标记器,然后查看它是否吸收了“ Experience”或“ Work History”之类的短语。 If not, you should add your own labels to it so that it will specifically tag those words as you desire. 如果没有,您应该在上面添加自己的标签,以便它可以根据需要专门标记这些单词。

Then you can run a NER to pick up Dates. 然后,您可以运行NER来获取日期。 Keep in mind that the NER at best will tag all your dates as DATE entities and will not be able to find the distinction between what type of dates these are. 请记住,NER充其量只能将您所有的日期标记为DATE实体,并且无法找到这些日期类型之间的区别。

You'll have to link the respective date to a preceding or following Part of Speech using some language grammar or a regular expression. 您必须使用某种语言语法或正则表达式将相应的日期链接到词类的前面或后面。

For instance you can associate all Dates that follow the word Experience to the Experience POS Tag. 例如,您可以将“ Experience”一词后面的所有日期与“ Experience POS Tag”相关联。

Alternatively you can try NLTK (which is an alternative to Spacy but you'll need to run the same pipeline with it too). 或者,您可以尝试NLTK(这是Spacy的替代方法,但您也需要使用它运行相同的管道)。 Read here for more. 在这里阅读更多。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM