简体   繁体   中英

Ngram order selection for feature engineering

I am working on feature engineering for text classification. I am stuck at a point over choosing features. Majority of the literatures say tokenize the text and use them as features(remove stop words ,punctuations), but then you miss out on multi-word words like (Lung cancer) or phrases. So the question is how do I decide the ngram order and treat them as features?

The relevant 2-gram (in this case Lung cancer) will appear by frequency.
Imagine the following text:

I know someone who has Lung cancer: Lung cancer is terrible disease.

2 克 vs 频率

If you make a list of the 2-grams you'll end with lung cancer first; and other combinations ('has Lung'; 'hate Lung') second.
This is because certain groups of words represent something - and are therefore called repeatedly - and others are just connectors ('has' or 'hate') that form 2-grams 'circumstantially'. The key is to filter by frequency.

If you are having issues generating n-grams, I feel you might be using the wrong libraries/toolset.

I would say that this highly depends on your training data. You can visualise distributions of bigrams and trigrams frequencies. This might give you an idea of the relevance of the n-gram order. You might also want to use noun chunks during your investigation. Relevant noun chunks (or parts of them) could appear often. It might give you a sense on how to select you n-grams.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM