[英]Sklearn other inputs in addition to text for text classification
I am trying to do a text classifier using "Sci kit" learn bag of words. 我正在尝试使用“Sci kit”学习一个单词的文本分类器。 Vectorization into a classifier.
矢量化成分类器。 However, I was wondering how would i add another variable to the input apart from the text itself.
但是,我想知道如何将除了文本本身之外的另一个变量添加到输入中。 Say I want to add a number of words in the text in addition to text (because I think it may affect the result).
假设我想在文本中添加一些单词以及文本(因为我认为它可能会影响结果)。 How should I go about doing so?
我该怎么办呢?
Do I have to add another classifier on top of that one? 我是否必须在那个分类器之上添加另一个分类器? Or is there a way to add that input to vectorized text?
或者有没有办法将该输入添加到矢量化文本?
Scikit learn classifiers works with numpy arrays. Scikit学习分类器适用于numpy数组。 This means that after your vectorization of text, you can add your new features to this array easily (I am taking this sentence back, not very easily but doable).
这意味着在对文本进行矢量化后,您可以轻松地将新功能添加到此数组中(我正在回答这个问题,不是很容易,但可行)。 Problem is in text categorization, your features will be sparse therefore normal numpy column additions does not work.
问题在于文本分类,您的功能将是稀疏的,因此正常的numpy列添加不起作用。
Code modified from text mining example from scikit learn scipy 2013 tutorial . 代码修改从scikit的文本挖掘示例学习scipy 2013教程 。
from sklearn.datasets import load_files
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
import numpy as np
import scipy
# Load the text data
twenty_train_subset = load_files('datasets/20news-bydate-train/',
categories=categories, encoding='latin-1')
# Turn the text documents into vectors of word frequencies
vectorizer = TfidfVectorizer(min_df=2)
X_train_only_text_features = vectorizer.fit_transform(twenty_train_subset.data)
print type(X_train_only_text_features)
print "X_train_only_text_features",X_train_only_text_features.shape
size = X_train_only_text_features.shape[0]
print "size",size
ones_column = np.ones(size).reshape(size,1)
print "ones_column",ones_column.shape
new_column = scipy.sparse.csr.csr_matrix(ones_column )
print type(new_column)
print "new_column",new_column.shape
X_train= scipy.sparse.hstack([new_column,X_train_only_text_features])
print "X_train",X_train.shape
output is following: 输出如下:
<class 'scipy.sparse.csr.csr_matrix'>
X_train_only_text_features (2034, 17566)
size 2034
ones_column (2034L, 1L)
<class 'scipy.sparse.csr.csr_matrix'>
new_column (2034, 1)
X_train (2034, 17567)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.