[英]How can I vectorize a list of words?
The idea is to use as vocabulary all the words found in this column across instances, except that the least frequent words should be removed (to avoid overfitting).这个想法是使用跨实例在该列中找到的所有单词作为词汇表,除了应该删除最不常见的单词(以避免过度拟合)。 Then for every instance the column is represented as vector of boolean features, where the nth value represents the nth word in the vocabulary: 1 if it is in the list for this instance, 0 if not.
然后对于每个实例,该列表示为布尔特征向量,其中第 n 个值表示词汇表中的第 n 个单词:如果它在此实例的列表中,则为 1,否则为 0。
In python you can use CountVectorizer, considering every list in the column as a sentence.在 python 中,您可以使用 CountVectorizer,将列中的每个列表视为一个句子。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.