简体繁体 English

如何矢量化单词列表？

[英]How can I vectorize a list of words?

原文 2022-06-11 06:06:55 0 1 text-classification/ countvectorizer

I am working on SMS data where I have a list of words in my one column of dataframe I want to train a classifier to predict it's type and subtype.我正在处理 SMS 数据，我的一列数据框中有一个单词列表，我想训练一个分类器来预测它的类型和子类型。 How would I convert the words into numerical format as they are in a list.我如何将单词转换为列表中的数字格式。

1 个解决方案

The idea is to use as vocabulary all the words found in this column across instances, except that the least frequent words should be removed (to avoid overfitting).这个想法是使用跨实例在该列中找到的所有单词作为词汇表，除了应该删除最不常见的单词（以避免过度拟合）。 Then for every instance the column is represented as vector of boolean features, where the nth value represents the nth word in the vocabulary: 1 if it is in the list for this instance, 0 if not.然后对于每个实例，该列表示为布尔特征向量，其中第 n 个值表示词汇表中的第 n 个单词：如果它在此实例的列表中，则为 1，否则为 0。

In python you can use CountVectorizer, considering every list in the column as a sentence.在 python 中，您可以使用 CountVectorizer，将列中的每个列表视为一个句子。

我怎么能得到word2vec词汇中没有的单词的向量？ - How I can get the vectors for words that were not present in word2vec vocabulary?

如何为TfidfVectorizer使用列表列表或集合列表？ - How can I use a list of lists, or a list of sets, for the TfidfVectorizer?

基于单词列表的分类R - Classification based on list of words R

列表中单词的二元分类器 - Binary classifier of words in list

如何使用apache spark通过列表取消识别文本中的特定单词？ - How to de-identify specific words in the text by list using apache spark?

将单词添加到 sklearn 中 TfidfVectorizer 中的 stop_words 列表 - adding words to stop_words list in TfidfVectorizer in sklearn

使用主题模型，我们应该如何设置“停用词”列表？ - Using Topic Model, how should we set up a “stop words” list?

如何缩小词袋 model？ - How to shrink a bag-of-words model?

如何在 scikit-learn 中正确地将数字特征与文本（词袋）结合起来？ - How do I properly combine numerical features with text (bag of words) in scikit-learn?

如何使用 NLP 查找句子与哪一组单词相近？ - How do I use NLP to find which group of words a sentence is closes to?

暂无

暂无

声明:本站的技术帖子网页，遵循CC BY-SA 4.0协议，如果您需要转载，请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 我怎么能得到word2vec词汇中没有的单词的向量？ - How I can get the vectors for words that were not present in word2vec vocabulary? 如何为TfidfVectorizer使用列表列表或集合列表？ - How can I use a list of lists, or a list of sets, for the TfidfVectorizer? 基于单词列表的分类R - Classification based on list of words R 列表中单词的二元分类器 - Binary classifier of words in list 如何使用apache spark通过列表取消识别文本中的特定单词？ - How to de-identify specific words in the text by list using apache spark? 将单词添加到 sklearn 中 TfidfVectorizer 中的 stop_words 列表 - adding words to stop_words list in TfidfVectorizer in sklearn 使用主题模型，我们应该如何设置“停用词”列表？ - Using Topic Model, how should we set up a “stop words” list? 如何缩小词袋 model？ - How to shrink a bag-of-words model? 如何在 scikit-learn 中正确地将数字特征与文本（词袋）结合起来？ - How do I properly combine numerical features with text (bag of words) in scikit-learn? 如何使用 NLP 查找句子与哪一组单词相近？ - How do I use NLP to find which group of words a sentence is closes to?

相关标签

粤ICP备18138465号 © 2020-2024 STACKOOM.COM