如何将新数据转换为我的训练数据的PCA组件？

Question

Suppose I have some text sentences that I want to cluster using kmeans. 假设我有一些文本句子，我想用kmeans进行聚类。

sentences = [
    "fix grammatical or spelling errors",
    "clarify meaning without changing it",
    "correct minor mistakes",
    "add related resources or links",
    "always respect the original author"
]

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.cluster import KMeans

vectorizer = CountVectorizer(min_df=1)
X = vectorizer.fit_transform(sentences)
num_clusters = 2
km = KMeans(n_clusters=num_clusters, init='random', n_init=1,verbose=1)
km.fit(X)

Now I could predict which of the classes a new text would fall into, 现在我可以预测新文本会落入哪个类，

new_text = "hello world"
vec = vectorizer.transform([new_text])
print km.predict(vec)[0]

However, say I apply PCA to reduce 10,000 features to 50. 但是，我说应用PCA将10,000个功能减少到50个。

from sklearn.decomposition import RandomizedPCA

pca = RandomizedPCA(n_components=50,whiten=True)
X2 = pca.fit_transform(X)
km.fit(X2)

I cannot do the same thing anymore to predict the cluster for a new text because the results from vectorizer are no longer relevant 我不能再做同样的事情来预测新文本的集群，因为矢量化器的结果不再相关

new_text = "hello world"
vec = vectorizer.transform([new_text]) ##
print km.predict(vec)[0]
ValueError: Incorrect number of features. Got 10000 features, expected 50

So how do I transform my new text into the lower dimensional feature space? 那么如何将我的新文本转换为低维特征空间？

Answer 1

You want to use pca.transform on your new data before feeding it to the model. 您希望在将新数据提供给模型之前对其使用pca.transform 。 This will perform dimensionality reduction using the same PCA model that was fitted when you ran pca.fit_transform on your original data. 这将使用您在原始数据上运行pca.fit_transform时安装的相同PCA模型来执行pca.fit_transform 。 You can then use your fitted model to predict on this reduced data. 然后，您可以使用拟合模型来预测此减少的数据。

Basically, think of it as fitting one large model, which consists of stacking three smaller models. 基本上，将其视为适合一个大型模型，其中包括堆叠三个较小的模型。 First you have a CountVectorizer model that determines how to process data. 首先，您有一个CountVectorizer模型，用于确定如何处理数据。 Then you run a RandomizedPCA model that performs dimensionality reduction. 然后运行一个执行降维的RandomizedPCA模型。 And finally you run a KMeans model for clustering. 最后，您运行KMeans模型进行群集。 When you fit the models, you go down the stack and fit each one. 当您适合模型时，您可以沿着堆栈向下移动每个模型。 And when you want to do prediction, you also have to go down the stack and apply each one. 当你想做预测时，你也必须下去并应用每一个。

# Initialize models
vectorizer = CountVectorizer(min_df=1)
pca = RandomizedPCA(n_components=50, whiten=True)
km = KMeans(n_clusters=2, init='random', n_init=1, verbose=1)

# Fit models
X = vectorizer.fit_transform(sentences)
X2 = pca.fit_transform(X)
km.fit(X2)

# Predict with models
X_new = vectorizer.transform(["hello world"])
X2_new = pca.transform(X_new)
km.predict(X2_new)

Answer 2

Use a Pipeline : 使用Pipeline ：

>>> from sklearn.cluster import KMeans
>>> from sklearn.decomposition import RandomizedPCA
>>> from sklearn.decomposition import TruncatedSVD
>>> from sklearn.feature_extraction.text import CountVectorizer
>>> from sklearn.pipeline import make_pipeline
>>> sentences = [
...     "fix grammatical or spelling errors",
...     "clarify meaning without changing it",
...     "correct minor mistakes",
...     "add related resources or links",
...     "always respect the original author"
... ]
>>> vectorizer = CountVectorizer(min_df=1)
>>> svd = TruncatedSVD(n_components=5)
>>> km = KMeans(n_clusters=2, init='random', n_init=1)
>>> pipe = make_pipeline(vectorizer, svd, km)
>>> pipe.fit(sentences)
Pipeline(steps=[('countvectorizer', CountVectorizer(analyzer=u'word', binary=False, decode_error=u'strict',
        dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,...n_init=1,
    n_jobs=1, precompute_distances='auto', random_state=None, tol=0.0001,
    verbose=1))])
>>> pipe.predict(["hello, world"])
array([0], dtype=int32)

(Showing TruncatedSVD because RandomizedPCA will stop working on text frequency matrices in an upcoming release; it actually performed an SVD, not full PCA, anyway.) （显示TruncatedSVD因为RandomizedPCA将在即将发布的版本中停止处理文本频率矩阵;它实际上执行了SVD，而不是完整的PCA。）

如何将新数据转换为我的训练数据的PCA组件？

问题描述

2 个解决方案

解决方案1
6 已采纳 2014-10-03 16:04:36

解决方案2
3 2014-10-04 09:47:54

如何将新数据转换为我的训练数据的PCA组件？

问题描述

2 个解决方案

解决方案1 6 已采纳 2014-10-03 16:04:36

解决方案2 3 2014-10-04 09:47:54

解决方案1
6 已采纳 2014-10-03 16:04:36

解决方案2
3 2014-10-04 09:47:54