简体   繁体   English

机器学习:表示单词特征的好方法

[英]Machine Learning: Good way to represent word features

Not quite sure if this is the right place or not.. But here is my question. 不太确定这是不是正确的地方。但这是我的问题。 So for features which are numeric in nature, it is quite natural to represent them, plot them, etc., but what about words ? 因此,对于本质上为数字的特征,表示它们,绘制它们等是很自然的,但是单词呢?

How do you deal with data where you have words as features? 在以词为特征的情况下,如何处理数据? So let's say I have a dataset with following features: 假设我有一个具有以下功能的数据集:

InventoryVal, Number of Units, Avg Price, Category of Event and so on..
  • InventoryVal is a number InventoryVal是一个数字
  • Number of Units is a number 单位数是一个数字
  • Avg Price is a number 平均价格是一个数字
  • Category of Event is a word that is assigned by humans. 事件类别是由人类分配的单词。

Event if I replace category (example) "books" by an id...... (say 1) but then that is also something which I have assigned and that's not something intrinsic of data. 如果我用id替换类别(例如)“ books”(例如1),则为事件……(但这是我已分配的东西,而不是数据固有的东西)。

What is a good metric to represent that a product belongs to category "art" without artificially assigning anything? 在没有人为分配任何东西的情况下,代表产品属于“艺术”类别的良好度量标准是什么? Eghh.. too vague or loosely worded question?/ 嗯..太含糊或措辞松散?/

So as you might have guessed there are entire ML libraries directed to this problem, but if you just want to get started, the simplest (and perhaps most common) is word frequency . 因此,您可能已经猜到有针对此问题的整个ML库,但是如果您只是想入门,那么最简单(也许是最常见)的就是word frequency In other words, you represent each word as a feature whose value is a function of the number of times that words occurs in each document. 换句话说,您将每个单词表示为一个特征,其值是每个文档中单词出现次数的函数。

But the most common words ( a, and, the, this , etc.) are the most commonly occurring (in ordinary text documents (eg, email messages) but are hardly the most important, so it is common to express a word feature as the inverse of it's frequency . 但是最常见的单词( a,the,this等)是最常见的单词(在普通文本文档(例如,电子邮件)中,但并不是最重要的,因此将单词特征表示为它的频率的倒数

So again, this is the simplest methodology ( bag of words is how it's usually referred to); 再次强调,这是最简单的方法(通常是用一句话来表达); more sophisticated analysis (which are not always required) pre-process the individual words to categorize them into eg, parts-of-speech analysis. 更复杂的分析(并非总是需要)对各个单词进行预处理,以将其分类为词类分析。

If you like python, i recommend NLTK (Natural Language Tool Kit) is a mature and well-documented python library. 如果您喜欢python,我建议NLTK (自然语言工具包)是一个成熟且有据可查的python库。 There are quite a few "getting started" tutorials, but perhaps begin with ones created by the NLTK contributors and which are referenced on the NLTK homepage; 有很多“入门”教程,但也许首先是由NLTK贡献者创建的,并在NLTK主页上进行了引用。 these tutorials usually rely on corpus (data set) included in the base NLTK install. 这些教程通常依赖基本NLTK安装中包含的语料库(数据集)。

If you are using an existing machine learning package, or a packaged machine learning algorithm, there may be a way to tell it that a particular field holds eg integers which are to be treated as identifiers, in which only comparisons for equality and inequality make sense. 如果您使用的是现有的机器学习包或打包的机器学习算法,则可能有一种方法可以告诉它特定字段包含例如整数,这些整数将被视为标识符,在这种情况下,只有相等和不相等的比较才有意义。 If not, if there are only a small number of distinct categories, it might make sense to replace a category field with 10 values with 10 binary fields, holding 1 if the object is in that particular category, or 0 if not (or 9 fields, with the object in the 10th category if all of them are 0). 如果不是这样,如果只有少量不同的类别,则可以用10个二进制字段替换10个值的类别字段,如果该对象属于该特定类别,则保留1;否则,则保留0(或9个字段) ,如果所有对象均为0,则对象位于第十个类别中。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM