简体   繁体   English

scikit-learn中带有FeatureUnion的自定义转换器mixin

[英]Custom transformer mixin with FeatureUnion in scikit-learn

I am writing custom transformers in scikit-learn in order to do specific operations on the array. 我正在scikit-learn中编写自定义转换器,以便对阵列执行特定操作。 For that I use inheritance of class TransformerMixin. 为此,我使用了TransformerMixin类的继承。 It works fine when I deal only with one transformer. 当我仅处理一个变压器时,它工作正常。 However when I try to chain them using FeatureUnion (or make_union), the array is replicated n-times. 但是,当我尝试使用FeatureUnion(或make_union)链接它们时,该数组将被复制n次。 What could I do to avoid that? 我该怎么做才能避免这种情况? Am I using scikit-learn as it is supposed to be? 我是否正在使用scikit-learn?

import numpy as np
from sklearn.base import TransformerMixin
from sklearn.pipeline import FeatureUnion

# creation of array
s1 = np.array(['foo', 'bar', 'baz'])
s2 = np.array(['a', 'b', 'c'])
X = np.column_stack([s1, s2])
print('base array: \n', X, '\n')

# A fake example that appends a column (Could be a score, ...) calculated on specific columns from X
class DummyTransformer(TransformerMixin):
    def __init__(self, value=None):
        TransformerMixin.__init__(self)
        self.value = value

    def fit(self, *_):
        return self

    def transform(self, X):
        # appends a column (in this case, a constant) to X
        s = np.full(X.shape[0], self.value)
        X = np.column_stack([X, s])
        return X

# as such, the transformer gives what I need first
transfo = DummyTransformer(value=1)
print('single transformer: \n', transfo.fit_transform(X), '\n')

# but when I try to chain them and create a pipeline I run into the replication of existing columns
stages = []
for i in range(2):
    transfo = DummyTransformer(value=i+1)
    stages.append(('step'+str(i+1),transfo))
pipeunion = FeatureUnion(stages)
print('Given result of the Feature union pipeline: \n', pipeunion.fit_transform(X), '\n')
# columns 1&2 from X are replicated

# I would expect:
expected = np.column_stack([X, np.full(X.shape[0], 1), np.full(X.shape[0], 2) ])
print('Expected result of the Feature Union pipeline: \n', expected, '\n')

Output: 输出:

base array: 
 [['foo' 'a']
 ['bar' 'b']
 ['baz' 'c']] 

single transformer: 
 [['foo' 'a' '1']
 ['bar' 'b' '1']
 ['baz' 'c' '1']] 

Given result of the Feature union pipeline: 
 [['foo' 'a' '1' 'foo' 'a' '2']
 ['bar' 'b' '1' 'bar' 'b' '2']
 ['baz' 'c' '1' 'baz' 'c' '2']] 

Expected result of the Feature Union pipeline: 
   [['foo' 'a' '1' '2']
   ['bar' 'b' '1' '2']
   ['baz' 'c' '1' '2']] 

Many thanks 非常感谢

FeatureUnion will just concatenate what its getting from internal transformers. FeatureUnion只会将其从内部变压器中获得的内容串联起来。 Now in your internal transformers, you are sending same columns from each one. 现在,在您的内部转换器中,您要从每一个发送相同的列。 Its upon the transformers to correctly send the correct data forward. 依靠变压器正确地发送正确的数据。

I would advise you to just return the new data from the internal transformers, and then concatenate the remaining columns either from outside or inside the FeatureUnion . 我建议您仅从内部转换器返回新数据,然后从FeatureUnion外部或内部连接其余列。

Look at this example if you havent already: 如果您还没有,请查看此示例:

You can do this for example: 您可以这样做,例如:

# This dont do anything, just pass the data as it is
class DataPasser(TransformerMixin):

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        return X

# Your transformer
class DummyTransformer(TransformerMixin):
    def __init__(self, value=None):
        TransformerMixin.__init__(self)
        self.value = value

    def fit(self, *_):
        return self

    # Changed this to only return new column after some operation on X
    def transform(self, X):
        s = np.full(X.shape[0], self.value)
        return s.reshape(-1,1)

After that, further down in your code, change this: 之后,在您的代码中进一步更改以下内容:

stages = []    

# Append our DataPasser here, so original data is at the beginning
stages.append(('no_change', DataPasser()))


for i in range(2):
    transfo = DummyTransformer(value=i+1)
    stages.append(('step'+str(i+1),transfo))

pipeunion = FeatureUnion(stages)

Running this new code has the result: 运行此新代码将得到以下结果:

('Given result of the Feature union pipeline: \n', 
array([['foo', 'a', '1', '2'],
       ['bar', 'b', '1', '2'],
       ['baz', 'c', '1', '2']], dtype='|S21'), '\n')
('Expected result of the Feature Union pipeline: \n', 
array([['foo', 'a', '1', '2'],
       ['bar', 'b', '1', '2'],
       ['baz', 'c', '1', '2']], dtype='|S21'), '\n')

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM