[英]How to use sklearn Pipeline with custom Features?
我正在使用Python和sklearn進行文本分類。 除了矢量化器之外,我還有一些自定義功能。 我想知道是否可以將它們與sklearn Pipeline一起使用以及如何將功能堆疊在其中。
我目前沒有管道分類代碼的簡短示例。 請告訴我,如果你發現它有什么不妥之處,將非常感謝你的幫助。 是否可以以某種方式將其與sklearn Pipeline一起使用? 我創建了自己的函數get_features(),它提取自定義特征,轉換矢量化器,縮放特征,最后堆疊所有特征。
import sklearn.svm
import re
from sklearn import metrics
import numpy
import scipy.sparse
import datetime
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.svm import SVC
from sklearn.svm import LinearSVC
from nltk.tokenize import word_tokenize, sent_tokenize
from sklearn.preprocessing import StandardScaler
# custom feature example
def words_capitalized(sentence):
tokens = []
# tokenize the sentence
tokens = word_tokenize(sentence)
counter = 0
for word in tokens:
if word[0].isupper():
counter += 1
return counter
# custom feature example
def words_length(sentence):
tokens = []
# tokenize the sentence
tokens = word_tokenize(sentence)
list_of_length = list()
for word in tokens:
list_of_length.append(length(word))
return list_of_length
def get_features(untagged_text, value, scaler):
# this function extracts the custom features
# transforms the vectorizer
# scales the features
# and finally stacks all of them
list_of_length = list()
list_of_capitals = list()
# transform vectorizer
X_bow = countVecWord.transform(untagged_text)
# I also see some people use X_bow = countVecWord.transform(untagged_text).todense(), what does the .todense() option do here?
for sentence in untagged_text:
list_of_urls.append([words_length(sentence)])
list_of_capitals.append([words_capitalized(sentence)])
# turn the feature output into a numpy vector
X_length = numpy.array(list_of_urls)
X_capitals = numpy.array(list_of_capitals)
if value == 1:
# fit transform for training set
X_length = = scaler.fit_transform(X_length)
X_capitals = scaler.fit_transform(X_capitals)
# if test set
else:
# transform only for test set
X_length = = scaler.transform(X_length)
X_capitals = scaler.transform(X_capitals)
# stack all features as a sparse matrix
X_two_bows = scipy.sparse.hstack((X_bow, X_length))
X_two_bows = scipy.sparse.hstack((X_two_bows , X_length))
X_two_bows = scipy.sparse.hstack((X_two_bows , X_capitals))
return X_two_bows
def fit_and_predict(train_labels, train_features, test_features, classifier):
# fit the training set
classifier.fit(train_features, train_labels)
# return the classification result
return classifier.predict(test_features)
if __name__ == '__main__':
input_sets = read_data()
X = input_sets[0]
Y = input_sets[1]
X_dev = input_sets[2]
Y_dev = input_sets[3]
# initialize the count vectorizer
countVecWord = sklearn.feature_extraction.text.CountVectorizer(ngram_range=(1, 3))
scaler= StandardScaler()
# extract features
# for training
X_total = get_features(X, 1, scaler)
# for dev set
X_total_dev = get_features(X_dev, 2, scaler)
# store labels as numpy array
y_train = numpy.asarray(Y)
y_dev = numpy.asarray(Y_dev)
# train the classifier
SVC1 = LinearSVC(C = 1.0)
y_predicted = list()
y_predicted = fit_and_predict(y_train, X_total, X_total_dev, SVC1)
print "Result for dev set"
precision, recall, f1, _ = metrics.precision_recall_fscore_support(y_dev, y_predicted)
print "Precision: ", precision, " Recall: ", recall, " F1-Score: ", f1
我知道有FeatureUnion,但我不知道它是否可以用於我的目的以及它是否會擴展和支持這些功能。
編輯:這似乎是一個良好的開端: https : //michelleful.github.io/code-blog/2015/06/20/pipelines/
還沒試過,我會發帖。 現在的問題是,我如何使用管道進行特征選擇。
對於任何感興趣的人,自定義要素類需要具有擬合和變換函數,然后才能在FeatureUnion中使用。 有關詳細示例,請在此處查看我的其他問題> 如何將不同的輸入放入sklearn管道中?
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.