简体   繁体   中英

Word/Phrase classification

I have a column containing 5000 string records. These records are individual words or phrases (not a sentence or paragraph). Most of these records are similar or contain similar elements(eg "Office", "offise" "ground floor office"). Also, someone manually classified 300 of these records into five categories (ie Residential, Industrial, Office, Retail, Other) which means I can use it to develop a supervised machine learning model. I did a bit of study on word2vec, but it seems they work on texts, not individual words and phrases. Please advise me on how I can do the classification. Please note that the number of the records in the column is growing and new records will be added in the future, so the solution must be able to classify new records.

The sample input and the desired output is as below:

'industrial' -> 'Industrial'
'Warehouse' -> 'Industrial'
'Workshop' -> 'Industrial'
'rear warehouse' -> 'Industrial'
'office suite' -> 'office'
'office/warehouse' -> 'office'
'office(b1)' -> 'office'
'house' -> 'Residential'
'suite' -> 'Residential'
'restaurant' -> 'Retail'
'retail unit with 3 bedroom dwelling above' -> 'Retail'
'shoe shop' -> 'Retail'
'unit 56' -> 'Other'
'24 Hastings street' -> 'Other'

Input & Output

You have a very typical text classification task.

There are many classification algorithms you could use, but the main areas for choice/improvement in your task are likely to be:

  • feature-extraction & feature-engineering: how do you turn those short texts into numerical data against which rules/thresholds can be learned?
  • overall process issues: for whatever "tough cases" exist that can't be learned from existing data, either initially or over time, how are necessary corrections fed back into an improved system

Initially, you should try 'bag of words' and 'character n-grams' (either alone or together) as ways to turn your short texts into feature vectors. That, alone, with sufficient training data should handle most of the kinds of cases you've shown so far, since it will help any classification algorithm discover certain 'slam-dunk' rules.

For example, that will effectively learn that 'shop' may always imply 'retail', or 'house' always implies 'residential', or 'office' implies commercial. And using character n-grams will also give the model clues as to how to handle other typos or variant forms of the same words.

There will be cases it can't handle well. I'd guess that you'd want `3 bedroom dwelling', alone, to be 'residential' – but in your examples, you've binned 'retail unit with 3 bedroom dwelling above' as 'retail'. With enough examples of desired behavior, a classifier might get that right, because it either sees 'retail' as a category with more precedence, or other words (like 'above') implying mixed-use that usually should be binned one way or another.

When you look at the cases that doesn't handle well, you'll then be able to consider more advanced approaches, like perhaps using word-vectors to represent words that weren't necessarily in your (small) training set, but could be considered near-synonyms to known words. (For example, one possible policy for handling words unknown to your training set, that arrive later, would be to use some external, larger word2vec model to replace any unknown word with the known word that's closest.)

But, you should really start with the most-simple feature approaches, see how far those get you and thus set a baseline for later improvements. Then, consider more advanced & custom techniques.

This is a classic example of Classification using ML where the features are built using NLP. There are multiple steps involved in the process.

  1. Feature Engineering: You need to decide whether you want words, phrases ( consisting of 1,2...n number of words ) - Use countvectorizer from sklearn which uses tf-idf algorithm and ngrams. You can also define the max number of features you want to use.
  2. Lemmatisation - Removal of Stop words ( use nltk corpus )
  3. Stemming - Converting to non-changing words ( use nltk corpus )
  4. Using supervised learning built a classification model using 300 pre-defined records ( use train/test - 70/30 split ) - You can go with Bayesian Gaussian Classifier ( recommended for NLP mostly) or Random Forest or Neural Networks depending on how much accuracy you want to achieve.
  5. Finally apply this model to the new set of records.

PS: The trick here is to identify and remove the right words to Step 2 ( like 'The', 'is') so that the model does not get biased.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM