![](/img/trans.png)
[英]How to write a fit_transformer with two inputs and include it in a pipeline in python sklearn?
[英]How to fit different inputs into an sklearn Pipeline?
我正在使用sklearn中的Pipeline對文本進行分類。
在這個例子中,Pipeline我有一個TfIDF矢量化器和一些用FeatureUnion包裝的自定義特征和一個分類器作為Pipeline步驟,然后我擬合訓練數據並進行預測:
from sklearn.pipeline import FeatureUnion, Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC
X = ['I am a sentence', 'an example']
Y = [1, 2]
X_dev = ['another sentence']
# load custom features and FeatureUnion with Vectorizer
features = []
measure_features = MeasureFeatures() # this class includes my custom features
features.append(('measure_features', measure_features))
countVecWord = TfidfVectorizer(ngram_range=(1, 3), max_features= 4000)
features.append(('ngram', countVecWord))
all_features = FeatureUnion(features)
# classifier
LinearSVC1 = LinearSVC(tol=1e-4, C = 0.10000000000000001)
pipeline = Pipeline(
[('all', all_features ),
('clf', LinearSVC1),
])
pipeline.fit(X, Y)
y_pred = pipeline.predict(X_dev)
# etc.
上面的代碼工作正常,但有一個扭曲。 我想對文本進行部分語音標記,並在標記文本上使用不同的Vectorizer。
X = ['I am a sentence', 'an example']
X_tagged = do_tagging(X)
# X_tagged = ['PP AUX DET NN', 'DET NN']
Y = [1, 2]
X_dev = ['another sentence']
X_dev_tagged = do_tagging(X_dev)
# load custom featues and FeatureUnion with Vectorizer
features = []
measure_features = MeasureFeatures() # this class includes my custom features
features.append(('measure_features', measure_features))
countVecWord = TfidfVectorizer(ngram_range=(1, 3), max_features= 4000)
# new POS Vectorizer
countVecPOS = TfidfVectorizer(ngram_range=(1, 4), max_features= 2000)
features.append(('ngram', countVecWord))
features.append(('pos_ngram', countVecWord))
all_features = FeatureUnion(features)
# classifier
LinearSVC1 = LinearSVC(tol=1e-4, C = 0.10000000000000001)
pipeline = Pipeline(
[('all', all_features ),
('clf', LinearSVC1),
])
# how do I fit both X and X_tagged here
# how can the different vectorizers get either X or X_tagged?
pipeline.fit(X, Y)
y_pred = pipeline.predict(X_dev)
# etc.
我如何正確地適應這種數據? 兩個矢量化器如何區分原始文本和pos文本? 我有什么選擇?
我也有自定義功能,其中一些將采用原始文本,另一些采用POS文本。
編輯:添加了MeasureFeatures()
from sklearn.base import BaseEstimator
import numpy as np
class MeasureFeatures(BaseEstimator):
def __init__(self):
pass
def get_feature_names(self):
return np.array(['type_token', 'count_nouns'])
def fit(self, documents, y=None):
return self
def transform(self, x_dataset):
X_type_token = list()
X_count_nouns = list()
for sentence in x_dataset:
# takes raw text and calculates type token ratio
X_type_token.append(type_token_ratio(sentence))
# takes pos tag text and counts number of noun pos tags (NN, NNS etc.)
X_count_nouns.append(count_nouns(sentence))
X = np.array([X_type_token, X_count_nouns]).T
print X
print X.shape
if not hasattr(self, 'scalar'):
self.scalar = StandardScaler().fit(X)
return self.scalar.transform(X)
然后,此特征變換器需要為count_nouns()函數獲取標記文本或者為type_token_ratio()獲取原始文本
我認為你必須在2個變形金剛(TfidfTransformer和POSTransformer )上做一個FeatureUnion 。 當然,您需要定義POSTransformer。
也許這篇文章會對你有幫助。
也許你的管道看起來像這樣。
pipeline = Pipeline([
('features', FeatureUnion([
('ngram_tf_idf', Pipeline([
('counts_ngram', CountVectorizer()),
('tf_idf_ngram', TfidfTransformer())
])),
('pos_tf_idf', Pipeline([
('pos', POSTransformer()),
('counts_pos', CountVectorizer()),
('tf_idf_pos', TfidfTransformer())
])),
('measure_features', MeasureFeatures())
])),
('classifier', LinearSVC())
])
這假設MeasureFeatures和POSTransformer是變形金剛符合sklearn API。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.