除文本分类外，还可以浏览其他输入

Question

I am trying to do a text classifier using "Sci kit" learn bag of words. 我正在尝试使用“Sci kit”学习一个单词的文本分类器。 Vectorization into a classifier. 矢量化成分类器。 However, I was wondering how would i add another variable to the input apart from the text itself. 但是，我想知道如何将除了文本本身之外的另一个变量添加到输入中。 Say I want to add a number of words in the text in addition to text (because I think it may affect the result). 假设我想在文本中添加一些单词以及文本（因为我认为它可能会影响结果）。 How should I go about doing so? 我该怎么办呢？
Do I have to add another classifier on top of that one? 我是否必须在那个分类器之上添加另一个分类器？ Or is there a way to add that input to vectorized text? 或者有没有办法将该输入添加到矢量化文本？

Answer 1

Scikit learn classifiers works with numpy arrays. Scikit学习分类器适用于numpy数组。 This means that after your vectorization of text, you can add your new features to this array easily (I am taking this sentence back, not very easily but doable). 这意味着在对文本进行矢量化后，您可以轻松地将新功能添加到此数组中（我正在回答这个问题，不是很容易，但可行）。 Problem is in text categorization, your features will be sparse therefore normal numpy column additions does not work. 问题在于文本分类，您的功能将是稀疏的，因此正常的numpy列添加不起作用。

Code modified from text mining example from scikit learn scipy 2013 tutorial . 代码修改从scikit的文本挖掘示例学习scipy 2013教程。

from sklearn.datasets import load_files
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
import numpy as np
import scipy

# Load the text data

twenty_train_subset = load_files('datasets/20news-bydate-train/',
    categories=categories, encoding='latin-1')

# Turn the text documents into vectors of word frequencies
vectorizer = TfidfVectorizer(min_df=2)
X_train_only_text_features = vectorizer.fit_transform(twenty_train_subset.data)


print type(X_train_only_text_features)
print "X_train_only_text_features",X_train_only_text_features.shape

size = X_train_only_text_features.shape[0]
print "size",size

ones_column = np.ones(size).reshape(size,1)
print "ones_column",ones_column.shape


new_column = scipy.sparse.csr.csr_matrix(ones_column )
print type(new_column)
print "new_column",new_column.shape

X_train= scipy.sparse.hstack([new_column,X_train_only_text_features])

print "X_train",X_train.shape

output is following: 输出如下：

<class 'scipy.sparse.csr.csr_matrix'>
X_train_only_text_features (2034, 17566)
size 2034
ones_column (2034L, 1L)
<class 'scipy.sparse.csr.csr_matrix'>
new_column (2034, 1)
X_train (2034, 17567)

除文本分类外，还可以浏览其他输入

问题描述

1 个解决方案

解决方案1
2 已采纳 2015-12-09 00:49:28

除文本分类外，还可以浏览其他输入

问题描述

1 个解决方案

解决方案1 2 已采纳 2015-12-09 00:49:28

解决方案1
2 已采纳 2015-12-09 00:49:28