简体   繁体   中英

Machine Learning: Good way to represent word features

Not quite sure if this is the right place or not.. But here is my question. So for features which are numeric in nature, it is quite natural to represent them, plot them, etc., but what about words ?

How do you deal with data where you have words as features? So let's say I have a dataset with following features:

InventoryVal, Number of Units, Avg Price, Category of Event and so on..
  • InventoryVal is a number
  • Number of Units is a number
  • Avg Price is a number
  • Category of Event is a word that is assigned by humans.

Event if I replace category (example) "books" by an id...... (say 1) but then that is also something which I have assigned and that's not something intrinsic of data.

What is a good metric to represent that a product belongs to category "art" without artificially assigning anything? Eghh.. too vague or loosely worded question?/

So as you might have guessed there are entire ML libraries directed to this problem, but if you just want to get started, the simplest (and perhaps most common) is word frequency . In other words, you represent each word as a feature whose value is a function of the number of times that words occurs in each document.

But the most common words ( a, and, the, this , etc.) are the most commonly occurring (in ordinary text documents (eg, email messages) but are hardly the most important, so it is common to express a word feature as the inverse of it's frequency .

So again, this is the simplest methodology ( bag of words is how it's usually referred to); more sophisticated analysis (which are not always required) pre-process the individual words to categorize them into eg, parts-of-speech analysis.

If you like python, i recommend NLTK (Natural Language Tool Kit) is a mature and well-documented python library. There are quite a few "getting started" tutorials, but perhaps begin with ones created by the NLTK contributors and which are referenced on the NLTK homepage; these tutorials usually rely on corpus (data set) included in the base NLTK install.

If you are using an existing machine learning package, or a packaged machine learning algorithm, there may be a way to tell it that a particular field holds eg integers which are to be treated as identifiers, in which only comparisons for equality and inequality make sense. If not, if there are only a small number of distinct categories, it might make sense to replace a category field with 10 values with 10 binary fields, holding 1 if the object is in that particular category, or 0 if not (or 9 fields, with the object in the 10th category if all of them are 0).

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM