简体   繁体   中英

Classify sentences using controlled vocabularies with python

I have several different medical vocabularies (such as medication, symptoms, signs, diseases), and some free-text diagnostic reports. I want to use tfidf or machine learning techniques to first break down the free text and then classify the important sentences into different categories. Python as a programming language For example, “patients need to take aspirin” are classified as “medication use”, and “aspirin” can be found in the medication vocabulary. Can you please recommend some algorithms for me? Thank you :)

I would suggest you to use CountVectorizer as you already have the list of keywords. In CountVectorizer there is a parameter to set Vocabulary. You can stick to your list of keywords as Vocabulary. So what CountVectorizer will do is check the document for those keywords and build a feature vector on basis of those keywords. Lets look at the example

from sklearn.feature_extraction.text import CountVectorizer
keywords=["aspirin","medication","patients"]
sen1="patients need to take aspirin"
sen2 = "medication required immediately"
vectorizer = CountVectorizer(vocabulary=keywords) 
corpus=[sen1,sen2]
X = vectorizer.transform(corpus)

After this when you print feature names of vectorizer:- print(vectorizer.get_feature_names()) You will see ['aspirin', 'medication', 'patients']

And when you see the vectors for each sentence by print(X.toarray()) you will see following matrix:- [[1 0 1][0 1 0]] So it has built a vector on basis of presence(1) and absence(0) of the keywords

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM