简体   繁体   English

执行命名实体识别-NLP

[英]Perform Named Entity Recognition - NLP

I am trying to learn how to perform Named Entity Recognition. 我正在尝试学习如何执行命名实体识别。

I have a set of discharge summaries containing medical information about patients. 我有一组出院总结,其中包含有关患者的医疗信息。 I converted my unstructured data into structured data. 我将非结构化数据转换为结构化数据。 Now, I have a DataFrame that looks like this: 现在,我有一个看起来像这样的DataFrame

Text                        |   Target
normal coronary arteries...     R060

The Text column contains information about the diagnosis of a patient, and the Target column contains the code that will need to be predicted in a further task. Text列包含有关患者诊断的信息,“ Target列包含在其他任务中需要预测的代码。

I have also constructed a dictionary that looks like this: 我还构建了一个像这样的字典:

Code (Key) | Term (Value)
A00          Cholera

This dictionary brings information about each diagnosis and the afferent code. 该词典带来有关每个诊断和传入代码的信息。 The term column will be used to identify the clinical entities in the corpus. term列将用于识别语料库中的临床实体。

I will need to train a classifier and predict the code in order to automate the process of assigning codes for the discharge summaries (I am explaining this to have an idea about the task I'm performing). 我将需要训练一个分类器并预测代码,以便为排放摘要分配代码的过程自动化(我正在解释这是为了对正在执行的任务有所了解)。

Until now I have converted my data into a structured one. 到目前为止,我已经将数据转换为结构化的数据。 I am trying to understand how I should perform Named Entity Recognition to label the medical terminology. 我试图了解如何执行命名实体识别来标记医学术语。 I would like to try direct matching and fuzzy matching but I am not sure what are the previous steps. 我想尝试直接匹配和模糊匹配,但是我不确定前面的步骤是什么。 Should I perform tokenizing, stemming, lemmatizing before? 我应该在之前执行标记化,词干化,词形化吗? Or firstly should I find the medical terminology as clinical named entities are often multi-token terms with nested structures that include other named entities inside them? 还是首先我应该找到医学术语,因为临床命名实体通常是带有嵌套结构的多令牌术语,其中包含嵌套的其他命名实体? Also what packages or tools are you recommending me to use in Python? 您还建议我在Python中使用哪些软件包或工具?

I am new in this field so any help will be appreciated! 我是这个领域的新手,所以我们将不胜感激! Thanks! 谢谢!

If you are asking for building a classification model, then you should go for deep learning. 如果您要构建分类模型,则应该进行深度学习。 Deep learning is highly efficient in classification. 深度学习在分类中非常高效。

While dealing with such type of language processing tasks, I recommend you to first tokenize your text and do padding. 在处理此类语言处理任务时,建议您首先标记文本并进行填充。 Basic tokenization should be enough, but you can go for more preprocessing like basic string processing because proper preprocessing can improve your model accuracy upto 3% or 4%. 基本的标记化就足够了,但是您可以进行更多的预处理,例如基本的字符串处理,因为适当的预处理可以将模型精度提高多达3%或4%。 For basic string processing, you can use regex(built-in package called re) in python. 对于基本的字符串处理,您可以在python中使用regex(称为re的内置包)。

https://docs.python.org/3/library/re.html https://docs.python.org/3/library/re.html

I think, you are doing mapping after preprocessing. 我认为,您正在预处理后进行映射。 Mapping should be enough for tasks like classification, but I recommend you to learn about word embeddings. 映射对于分类之类的任务应该足够了,但是我建议您学习单词嵌入。 Word embedding will improve your model. 词嵌入将改善您的模型。

For all these tasks, i recommend you to use tensorflow. 对于所有这些任务,我建议您使用tensorflow。 Tensorflow is famous tool for machine learning, language processing, image processing, and much more. Tensorflow是著名的机器学习,语言处理,图像处理等工具。 You can learn natural language processing from official tensorflow documentation. 您可以从官方tensorflow文档中学习自然语言处理。 They have provided all learning material in tensorflow tutorial section. 他们在tensorflow教程部分中提供了所有学习材料。

https://www.tensorflow.org/tutorials/ https://www.tensorflow.org/tutorials/

I think, this will help you. 我认为,这将对您有所帮助。 All the best for your work!!!! 祝您工作顺利!!!

Thank you. 谢谢。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM