简体   繁体   English

除文本分类外,还可以浏览其他输入

[英]Sklearn other inputs in addition to text for text classification

I am trying to do a text classifier using "Sci kit" learn bag of words. 我正在尝试使用“Sci kit”学习一个单词的文本分类器。 Vectorization into a classifier. 矢量化成分类器。 However, I was wondering how would i add another variable to the input apart from the text itself. 但是,我想知道如何将除了文本本身之外的另一个变量添加到输入中。 Say I want to add a number of words in the text in addition to text (because I think it may affect the result). 假设我想在文本中添加一些单词以及文本(因为我认为它可能会影响结果)。 How should I go about doing so? 我该怎么办呢?
Do I have to add another classifier on top of that one? 我是否必须在那个分类器之上添加另一个分类器? Or is there a way to add that input to vectorized text? 或者有没有办法将该输入添加到矢量化文本?

Scikit learn classifiers works with numpy arrays. Scikit学习分类器适用于numpy数组。 This means that after your vectorization of text, you can add your new features to this array easily (I am taking this sentence back, not very easily but doable). 这意味着在对文本进行矢量化后,您可以轻松地将新功能添加到此数组中(我正在回答这个问题,不是很容易,但可行)。 Problem is in text categorization, your features will be sparse therefore normal numpy column additions does not work. 问题在于文本分类,您的功能将是稀疏的,因此正常的numpy列添加不起作用。

Code modified from text mining example from scikit learn scipy 2013 tutorial . 代码修改从scikit的文本挖掘示例学习scipy 2013教程

from sklearn.datasets import load_files
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
import numpy as np
import scipy

# Load the text data

twenty_train_subset = load_files('datasets/20news-bydate-train/',
    categories=categories, encoding='latin-1')

# Turn the text documents into vectors of word frequencies
vectorizer = TfidfVectorizer(min_df=2)
X_train_only_text_features = vectorizer.fit_transform(twenty_train_subset.data)


print type(X_train_only_text_features)
print "X_train_only_text_features",X_train_only_text_features.shape

size = X_train_only_text_features.shape[0]
print "size",size

ones_column = np.ones(size).reshape(size,1)
print "ones_column",ones_column.shape


new_column = scipy.sparse.csr.csr_matrix(ones_column )
print type(new_column)
print "new_column",new_column.shape

X_train= scipy.sparse.hstack([new_column,X_train_only_text_features])

print "X_train",X_train.shape

output is following: 输出如下:

<class 'scipy.sparse.csr.csr_matrix'>
X_train_only_text_features (2034, 17566)
size 2034
ones_column (2034L, 1L)
<class 'scipy.sparse.csr.csr_matrix'>
new_column (2034, 1)
X_train (2034, 17567)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 SkLearn model 用于文本分类 - SkLearn model for text classification 使用 Sklearn 进行多标签文本分类 - Multilabel text classification with Sklearn Sklearn 文本分类:为什么准确率这么低? - Sklearn text classification: Why is accuracy so low? 具有多个输入的神经网络(keras、文本分类) - Neural Network with several inputs (keras, text classification) 如何使用sklearn库使用朴素贝叶斯执行文本分类? - How to perform text classification with naive bayes using sklearn library? sklearn文本分类model不管实际内容返回单个class - Sklearn text classification model returns single class regardless of actual content 为 SKLearn 文本分类管道生成 PMML 管道时出错 - Error generating PMML pipeline for SKLearn Text classification Pipeline 通过TF-IDF功能选择重用sklearn文本分类模型 - Reusing an sklearn text classification model with tf-idf feature selection Sklearn + Gensim:如何使用Gensim的Word2Vec嵌入进行Sklearn文本分类 - Sklearn+Gensim: How to use Gensim's Word2Vec embedding for Sklearn text classification 如何使用 sklearn Pipeline &amp; FeatureUnion 选择多个(数字和文本)列进行文本分类? - How to select multiple (numerical & text) columns using sklearn Pipeline & FeatureUnion for text classification?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM