简体   繁体   English

如何矢量化单词列表?

[英]How can I vectorize a list of words?

I am working on SMS data where I have a list of words in my one column of dataframe I want to train a classifier to predict it's type and subtype.我正在处理 SMS 数据,我的一列数据框中有一个单词列表,我想训练一个分类器来预测它的类型和子类型。 How would I convert the words into numerical format as they are in a list.我如何将单词转换为列表中的数字格式。

数据集

The idea is to use as vocabulary all the words found in this column across instances, except that the least frequent words should be removed (to avoid overfitting).这个想法是使用跨实例在该列中找到的所有单词作为词汇表,除了应该删除最不常见的单词(以避免过度拟合)。 Then for every instance the column is represented as vector of boolean features, where the nth value represents the nth word in the vocabulary: 1 if it is in the list for this instance, 0 if not.然后对于每个实例,该列表示为布尔特征向量,其中第 n 个值表示词汇表中的第 n 个单词:如果它在此实例的列表中,则为 1,否则为 0。

In python you can use CountVectorizer, considering every list in the column as a sentence.在 python 中,您可以使用 CountVectorizer,将列中的每个列表视为一个句子。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 我怎么能得到word2vec词汇中没有的单词的向量? - How I can get the vectors for words that were not present in word2vec vocabulary? 如何为TfidfVectorizer使用列表列表或集合列表? - How can I use a list of lists, or a list of sets, for the TfidfVectorizer? 基于单词列表的分类R - Classification based on list of words R 列表中单词的二元分类器 - Binary classifier of words in list 如何使用apache spark通过列表取消识别文本中的特定单词? - How to de-identify specific words in the text by list using apache spark? 将单词添加到 sklearn 中 TfidfVectorizer 中的 stop_words 列表 - adding words to stop_words list in TfidfVectorizer in sklearn 使用主题模型,我们应该如何设置“停用词”列表? - Using Topic Model, how should we set up a “stop words” list? 如何缩小词袋 model? - How to shrink a bag-of-words model? 如何在 scikit-learn 中正确地将数字特征与文本(词袋)结合起来? - How do I properly combine numerical features with text (bag of words) in scikit-learn? 如何使用 NLP 查找句子与哪一组单词相近? - How do I use NLP to find which group of words a sentence is closes to?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM