简体   繁体   中英

How to use bag of words or tf-idf to classify text

I have a general question regarding classifying using bag of words or similar methods.

I have text that I am trying to classify.The classes are known to me and I know that each sentence of the text belongs to one type of sentences.For example sentence 1 should be an order,Sentence 2 should be news, etc.

So what I was thinking is to use n-gram generation for feature extraction and my idea is that n-grams of words can be helpful for the machine to find the right category.But implementing the idea using Python is not easy for me.I can not connect concepts with impelementation. For example I am not sure if I have to supply all possible chunks of POS tags that can belong to each category or the machine can find them.Also, I feel that n-grams can be helpful in this kind of analysis.But I don't know how.

It would be great if can give me some ideas or tell me the steps I should take to do this kind of classification.

Best

To use ngrams in this type of analysis, you can extract all the ngrams that appear in the text. Then, you can calculate TF-IDF for each ngram in each sentence in the following way:

  • TF: represents the number of times an ngram appears in the sentence.
  • IDF: represents the proportion of sentences that include that ngram.

This will give you a TF-IDF metric that measures the 'value of each ngram to each sentence given all sentences'. Once you have the TF-IDF metrics, you can feed your sentences in a standard supervised method.

For each class, you can also build language models based on you ngrams, POS tags, and even dependency parsed sentences. Then, given a new sentence you can calculate the likelihood that the sentence can be generated from each of the language models. Then again, you can take advantage of these probability values in a supervised learning method.

I suggest you check out the following articles:

1 - Look at Section 5.1 here for the use of TF-IDF

2- This document provides an example for the use of language models

Good luck ;)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM