简体   繁体   中英

sklearn pipeline ValueError: all the input array dimensions except for the concatenation axis must match exactly

I have a sklearn pipeline, which extracts three different features.

manual_feats = Pipeline([
        ('FeatureUnion', FeatureUnion([
            ('segmenting_pip1', Pipeline([
                ('A_features', A_features()),
                ('segmentation', segmentation())
            ])),
            ('segmenting_pip2', Pipeline([
                ('B_features', B_features(),
                ('segmentation', segmentation())
            ])),
            ('segmenting_pip3', Pipeline([
                ('Z_features', Z_features()),
                ('segmentation', segmentation())
            ])),

        ])),
    ])

Given that the features A and B each returns an array of dim (# of records, 10, 20), while Z returns (# of records, 10, 15).

When I fit the pipeline with all the feature I get this error:

 File "C:\Python35\lib\site-packages\sklearn\pipeline.py", line 451, in _transform
    Xt = transform.transform(Xt)
  File "C:\Python35\lib\site-packages\sklearn\pipeline.py", line 829, in transform
    Xs = np.hstack(Xs)
  File "C:\Python35\lib\site-packages\numpy\core\shape_base.py", line 340, in hstack
    return _nx.concatenate(arrs, 1)
ValueError: all the input array dimensions except for the concatenation axis must match exactly

But if I exclude feature Z the pipeline works but the concatenation applied on the axis=1 dim (# of records, 20, 20). What I want to is to get an array of (# of records, 10, 40) dimension, where the concatenation process applied on axis=2 .

How can I get what I want using Pipeline and without editing the source code of the library?

Edit: I mentioned that the concatenation of A and B produces an array of (# of records, 10, 40) DIM. This not correct; it produces an array of DIM (# of records, 20, 20). I'll edit the question.

I solved the problem by creating a transformer that handles the concatenation process.

class append_split_3D(BaseEstimator, TransformerMixin):
    def __init__(self, segments_number=20, max_len=50, mode='append'):
        self.segments_number = segments_number
        self.max_len = max_len
        self.mode = mode
        self.appending_value = -5.123

    def fit(self, X, y=None):
        return self

    def transform(self, data):
        if self.mode == 'append':
            self.max_len = self.max_len - data.shape[2]
            appending = np.full((data.shape[0], data.shape[1], self.max_len), self.appending_value)
            new = np.concatenate([data, appending], axis=2)
            return new
        elif self.mode == 'split':
            tmp = []
            for item in range(0, data.shape[1], self.segments_number):
                tmp.append(data[:, item:(item + self.segments_number), :])
            tmp = [item[item != self.appending_value].reshape(data.shape[0], self.segments_number, -1) for item in tmp]
            new = np.concatenate(tmp, axis=2)
            return new
        else:
            print('Error: Mode value is not defined')
            exit(1)

where the full pipeline becomes like this:

manual_feats = Pipeline([
        ('FeatureUnion', FeatureUnion([
            ('segmenting_pip1', Pipeline([
                ('A_features', A_features()),
                ('segmentation', segmentation()),
                ('append', append_split_3D(max_len=50, mode='append')),
            ])),
            ('segmenting_pip2', Pipeline([
                ('B_features', B_features(),
                ('segmentation', segmentation())
                ('append', append_split_3D(max_len=50, mode='append')),
            ])),
            ('segmenting_pip3', Pipeline([
                ('Z_features', Z_features()),
                ('segmentation', segmentation())
                ('append', append_split_3D(max_len=50, mode='append')),
            ])),

        ])),
        ('split', append_split_3D(segments_number=10, mode='split')),
    ])

What I did in this transformer is the following: As an example, the feature A , B , and Z that I have returns the following arrays:

  • A : (# of records, 10, 20)
  • B : (# of records, 10, 20)
  • Z : (# of records, 10, 15)

In the mode='append' , I append all the arrays with extra fixed values of max length value of 50 (as an example) to have the same axis=2 dim and to allow the function Xs = np.hstack(Xs) to work.

Thus, as a result, the Pipeline will return an array of: (# of records, 30, 50)

Then, In the mode=split' , I add it at the end of the pipeline, I split the final array into their appended shape: (# of records, 30, 50) to 3 features arrays of dim (# of records, 10, 50)

Then I delete the extra fixed value, and apply concatenation on the last dim.

The dim of final array is: (# of records, 10, 55) . 55 is the concatenation of the 3rd dimension of the arrays (20+20+15), which is what I want.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM