简体   繁体   中英

How to do sequence labeling with an unlabeled dataset

I have 1000 text files which have discharge summary for patients

SAMPLE_1

The patient was admitted on 21/02/99. he appeared to have pneumonia at the time of admission so we empirically covered him for community-acquired pneumonia with ceftriaxone and azithromycin until day 2 when his blood cultures grew out strep pneumoniae that was pan sensitive so we stopped the ceftriaxone and completed a 5 day course of azithromycin. But on day 4 he developed diarrhea so we added flagyl to cover for c.diff, which did come back positive on day 6 so he needs 3 more days of that…” this can be summarized more concisely as follows: “Completed 5 day course of azithromycin for pan sensitive strep pneumoniae pneumonia complicated by c.diff colitis. Currently on day 7/10 of flagyl and c.diff negative on 9/21.

SAMPLE_2

The patient is an 56-year-old female with history of previous stroke; hypertension; COPD, stable; renal carcinoma; presenting after a fall and possible syncope. While walking, she accidentally fell to her knees and did hit her head on the ground, near her left eye. Her fall was not observed, but the patient does not profess any loss of consciousness, recalling the entire event. The patient does have a history of previous falls, one of which resulted in a hip fracture. She has had physical therapy and recovered completely from that. Initial examination showed bruising around the left eye, normal lung examination, normal heart examination, normal neurologic function with a baseline decreased mobility of her left arm. The patient was admitted for evaluation of her fall and to rule out syncope and possible stroke with her positive histories.

I also have a csv file which is 1000rows X 5columns. Each row has information entered manually for each of the text file. So for example for the above two files, someone has manually entered these records in the csv file:

Sex, Primary Disease,Age, Date of admission,Other complications
M,Pneumonia, NA, 21/02/99, Diarhhea
F,(Hypertension,stroke), 56, NA, NA

My question is:

  1. How do I represent use this information of text:labels to a machine learning algorithm

  2. Do I need to do some manual labelling around the areas of interest in all the 1000 text files?

If yes then how and which method to use. (ie like <ADMISSION> was admitted on 21/02/99</ADMISSION> , <AGE>56-year-old</AGE> )

So basically how do I use this text:labels data to automate the filling of labels.

As far as I can tell the point is not to mark up the texts, but to extract the information represented by the annotations. This is an information extraction problem, and you should read up on techniques for this. The CSV file contains the information you want to extract (your "gold standard", so you should start by splitting it into training (90%) and testing (10%) subsets.

There is a named entity recognition task in there: Recognize diseases, numbers, dates and gender. You could use an off-the shelf chunker, or find an annotated medical corpus and use it to train one. You can also use a mix of approaches; spotting words that reveal gender is something you could hand-code pretty easily, for example. Once you have all these words, you need some more work, for example, to distinguish the primary disease from the symptoms; the age from other numbers, and the date of admission from any other dates. This is probably best done as a separate classification task.

I recommend you now read through the nltk book , chapter by chapter, so that you have some idea of what the available techniques and tools are. It's the approach that matters, so don't get bogged down in comparisons of specific machine learning engines.

I'm afraid the algorithm that fills the gaps has not yet been invented. If the gaps were strongly correlated or had some sort of causality you might be able to model that with some sort of Bayesian model. Still with the amount of data you have this is pretty much impossible.

Now on the more practical side of things. You can take two approaches:

  1. Treat the problem as a document-level task in which case you can just take all rows with a label and train on them and infer the labels/values of the rest. You should look at Naïve Bayes, Multi-class SVM, MaxEnt, etc. for the categorical columns and linear regression for predicting the numerical values.
  2. Treat the problem as an information extraction task in which case you have to add the annotation you mentioned inside the text and train a sequence model. You should look at CRF, structured SVM, HMM, etc. Actually, you could look at some systems that adapt multiclass classifiers to sequence labeling tasks, eg SVMTool for POS tagging (can be adapted to most sequence labeling tasks).

Now about the problems, you will face. In 1. it is very unlikely that you will predict the date of the record with any algorithm. It might be possible to roughly predict the patient age as this is something that usually correlates with diseases, etc. And it's very very unlikely that you will be able to even set up the disease column as an entity extraction task.

If I have to solve your problem I would probably pick approach 2. which is imho the correct approach but could is also quite a bit of work. In that case, you will need to create markup annotations yourself. A good starting point is an annotation tool called brat . Once you have your annotations, you could develop a classifier in the style of CoNLL-2003 .

What you are trying to achieve seems quite a bit, especially with 1000 records. I think (depending on your data) you may be better off using ready products instead of building them yourself. There are open source and commercial products that might be able to use -- lexigram.io has an API, MetaMap and Apache cTAKES are state-of-the-art open source tools for clinical entity extraction.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM