简体   繁体   English

您是否需要在sklearn中缩放Vectorizer?

[英]Do you need to scale Vectorizers in sklearn?

I have a set of custom features and a set of features created with Vectorizers, in this case TfidfVectorizer. 我有一组自定义功能,以及使用Vectorizers创建的一组功能,在本例中为TfidfVectorizer。

All of my custom features are simple np.arrays (eg [0, 5, 4, 22, 1]). 我所有的自定义功能都是简单的np.arrays(例如[0,5,4,22,1])。 I am using StandardScaler to scale all of my featues, as you can see in my Pipeline by calling StandardScaler after my "custom pipeline". 我正在使用StandardScaler缩放所有功能,正如您在“管道”中通过在“自定义管道”之后调用StandardScaler所看到的那样。 The question is whether there is a way or a need to scale the Vectorizers I use in my "vectorized_pipeline". 问题是,是否有办法缩放我在“ vectorized_pipeline”中使用的矢量化器。 Applying StandardScaler on the vectorizers doesn't seem to work (I get the following Error: "ValueError: Cannot center sparse matrices"). 在矢量化器上应用StandardScaler似乎不起作用(出现以下错误:“ ValueError:无法居中稀疏矩阵”)。

And another question, is it sensible to scale all of my features after I have joined them in the FeatureUnion or do I scale each of them separately (in my example, by calling the scaler in "pos_cluster" and "stylistic_features" seprately instead of calling it after the both of them have been joined), what is a better practice of doing this? 另一个问题是,在我将所有功能加入FeatureUnion之后,对所有功能进行缩放是否明智?还是我分别对每个功能进行缩放(在我的示例中,分别在“ pos_cluster”和“ stylistic_features”中调用缩放器,而不是调用他们两个都加入之后),这样做的更好方法是什么?

from sklearn.pipeline import FeatureUnion, Pipeline
from sklearn import feature_selection
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import StandardScaler
from sklearn.svm import LinearSVC

X = ['I am a sentence', 'an example']
Y = [1, 2]
X_dev = ['another sentence']

inner_scaler = StandardScaler()
# classifier
LinearSVC1 = LinearSVC(tol=1e-4,  C = 0.10000000000000001)

# vectorizers
countVecWord = TfidfVectorizer(ngram_range=(1, 3), max_features=2000, analyzer=u'word', sublinear_tf=True, use_idf = True, min_df=2, max_df=0.85, lowercase = True)
countVecWord_tags = TfidfVectorizer(ngram_range=(1, 4), max_features= 1000, analyzer=u'word', min_df=2, max_df=0.85, sublinear_tf=True, use_idf = True, lowercase = False)


pipeline = Pipeline([
    ('union', FeatureUnion(
            transformer_list=[

            ('vectorized_pipeline', Pipeline([
                ('union_vectorizer', FeatureUnion([

                    ('stem_text', Pipeline([
                        ('selector', ItemSelector(key='stem_text')),
                        ('stem_tfidf', countVecWord)
                    ])),

                    ('pos_text', Pipeline([
                        ('selector', ItemSelector(key='pos_text')),
                        ('pos_tfidf', countVecWord_tags)
                    ])),

                ])),
                ])),


            ('custom_pipeline', Pipeline([
                ('custom_features', FeatureUnion([

                    ('pos_cluster', Pipeline([
                        ('selector', ItemSelector(key='pos_text')),
                        ('pos_cluster_inner', pos_cluster)
                    ])),

                    ('stylistic_features', Pipeline([
                        ('selector', ItemSelector(key='raw_text')),
                        ('stylistic_features_inner', stylistic_features)
                    ]))

                ])),
                    ('inner_scale', inner_scaler)
            ])),

            ],

            # weight components in FeatureUnion
            # n_jobs=6,

            transformer_weights={
                'vectorized_pipeline': 0.8,  # 0.8,
                'custom_pipeline': 1.0  # 1.0
            },
    )),

    ('clf', classifier),
    ])

pipeline.fit(X, Y)
y_pred = pipeline.predict(X_dev)

First things first: 首先要注意的是:

Error "Cannot center sparse matrices" 错误“无法居中稀疏矩阵”

The reason is quite simple - StandardScaler efficiently applies feature-wise transformation: 原因很简单-StandardScaler有效地应用了按功能转换:

f_i = (f_i - mean(f_i)) / std(f_i)

which for sparse matrices will result in the dense ones, as mean(f_i) will be non zero (usually). 对于稀疏矩阵,这将导致密集矩阵,因为均值(f_i)通常不为零。 In practise only features equal to their means - will end up being zero. 实际上,只有等于其均值的要素最终将为零。 Scikit learn does not want to do this, as this is a huge modification of your data, which might result in failures in other parts of code, huge usage of memory etc. How to deal with it? Scikit learning不想这样做,因为这是对数据的巨大修改,这可能会导致其他部分代码出现故障,大量使用内存等。如何处理呢? If you really want to do this, there are two options: 如果您确实要执行此操作,则有两种选择:

  • densify your matrix through .toarray(), which will require lots of memory, but will give you exactly what you expect 通过.toarray()压缩矩阵,这将需要大量的内存,但会给您确切的期望
  • create StandardScaler without mean, thus StandardScaler(with_mean = False) which instaed willl apply f_i = f_i / std(f_i) , but will leave sparse format of your data. 创建没有平均值的StandardScaler,因此安装了StandardScaler(with_mean = False)将应用f_i = f_i / std(f_i) ,但将保留数据的稀疏格式。

Is scalind needed? 是否需要scalind?

This is a whole other problem - usualy, scaling (of any form) is just a heuristics . 这是另一个完全不同的问题-通常,缩放(任何形式)都只是一种启发式方法 This is not something that you have to apply, there are no guarantees that it will help, it is just a reasonable thing to do when you have no idea what your data looks like. 这不是您必须应用的东西,无法保证它会有所帮助,当您不知道数据的样子时,这只是一件合理的事情。 "Smart" vectorizers, such as tfidf are actually already doing that. 诸如tfidf之类的“智能”矢量化程序实际上已经在这样做。 The idf transformation is supposed to create a kind of reasonable data scaling. idf转换应该创建一种合理的数据缩放。 There is no guarantee which one will be better, but in general, tfidf should be enough. 无法保证哪个会更好,但总的来说,tfidf应该足够。 Especially given the fact, that it still support sparse computations, while StandardScaler does not. 特别是考虑到事实,它仍然支持稀疏计算,而StandardScaler不支持。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM