简体   繁体   English

如何将新数据转换为我的训练数据的PCA组件?

[英]How do I convert new data into the PCA components of my training data?

Suppose I have some text sentences that I want to cluster using kmeans. 假设我有一些文本句子,我想用kmeans进行聚类。

sentences = [
    "fix grammatical or spelling errors",
    "clarify meaning without changing it",
    "correct minor mistakes",
    "add related resources or links",
    "always respect the original author"
]

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.cluster import KMeans

vectorizer = CountVectorizer(min_df=1)
X = vectorizer.fit_transform(sentences)
num_clusters = 2
km = KMeans(n_clusters=num_clusters, init='random', n_init=1,verbose=1)
km.fit(X)

Now I could predict which of the classes a new text would fall into, 现在我可以预测新文本会落入哪个类,

new_text = "hello world"
vec = vectorizer.transform([new_text])
print km.predict(vec)[0]

However, say I apply PCA to reduce 10,000 features to 50. 但是,我说应用PCA将10,000个功能减少到50个。

from sklearn.decomposition import RandomizedPCA

pca = RandomizedPCA(n_components=50,whiten=True)
X2 = pca.fit_transform(X)
km.fit(X2)

I cannot do the same thing anymore to predict the cluster for a new text because the results from vectorizer are no longer relevant 我不能再做同样的事情来预测新文本的集群,因为矢量化器的结果不再相关

new_text = "hello world"
vec = vectorizer.transform([new_text]) ##
print km.predict(vec)[0]
ValueError: Incorrect number of features. Got 10000 features, expected 50

So how do I transform my new text into the lower dimensional feature space? 那么如何将我的新文本转换为低维特征空间?

You want to use pca.transform on your new data before feeding it to the model. 您希望在将新数据提供给模型之前对其使用pca.transform This will perform dimensionality reduction using the same PCA model that was fitted when you ran pca.fit_transform on your original data. 这将使用您在原始数据上运行pca.fit_transform时安装的相同PCA模型来执行pca.fit_transform You can then use your fitted model to predict on this reduced data. 然后,您可以使用拟合模型来预测此减少的数据。

Basically, think of it as fitting one large model, which consists of stacking three smaller models. 基本上,将其视为适合一个大型模型,其中包括堆叠三个较小的模型。 First you have a CountVectorizer model that determines how to process data. 首先,您有一个CountVectorizer模型,用于确定如何处理数据。 Then you run a RandomizedPCA model that performs dimensionality reduction. 然后运行一个执行降维的RandomizedPCA模型。 And finally you run a KMeans model for clustering. 最后,您运行KMeans模型进行群集。 When you fit the models, you go down the stack and fit each one. 当您适合模型时,您可以沿着堆栈向下移动每个模型。 And when you want to do prediction, you also have to go down the stack and apply each one. 当你想做预测时,你也必须下去并应用每一个。

# Initialize models
vectorizer = CountVectorizer(min_df=1)
pca = RandomizedPCA(n_components=50, whiten=True)
km = KMeans(n_clusters=2, init='random', n_init=1, verbose=1)

# Fit models
X = vectorizer.fit_transform(sentences)
X2 = pca.fit_transform(X)
km.fit(X2)

# Predict with models
X_new = vectorizer.transform(["hello world"])
X2_new = pca.transform(X_new)
km.predict(X2_new)

Use a Pipeline : 使用Pipeline

>>> from sklearn.cluster import KMeans
>>> from sklearn.decomposition import RandomizedPCA
>>> from sklearn.decomposition import TruncatedSVD
>>> from sklearn.feature_extraction.text import CountVectorizer
>>> from sklearn.pipeline import make_pipeline
>>> sentences = [
...     "fix grammatical or spelling errors",
...     "clarify meaning without changing it",
...     "correct minor mistakes",
...     "add related resources or links",
...     "always respect the original author"
... ]
>>> vectorizer = CountVectorizer(min_df=1)
>>> svd = TruncatedSVD(n_components=5)
>>> km = KMeans(n_clusters=2, init='random', n_init=1)
>>> pipe = make_pipeline(vectorizer, svd, km)
>>> pipe.fit(sentences)
Pipeline(steps=[('countvectorizer', CountVectorizer(analyzer=u'word', binary=False, decode_error=u'strict',
        dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,...n_init=1,
    n_jobs=1, precompute_distances='auto', random_state=None, tol=0.0001,
    verbose=1))])
>>> pipe.predict(["hello, world"])
array([0], dtype=int32)

(Showing TruncatedSVD because RandomizedPCA will stop working on text frequency matrices in an upcoming release; it actually performed an SVD, not full PCA, anyway.) (显示TruncatedSVD因为RandomizedPCA将在即将发布的版本中停止处理文本频率矩阵;它实际上执行了SVD,而不是完整的PCA。)

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 训练和测试后,我如何处理新数据? - after training and testing, how do i handle new data? 如何将我的 PCA 结果应用于未标记的数据? - How can I implement my PCA results to my unlabelled data? 如何在 python 的 PCA 图中找到数据点? - How can I find the data point in my PCA plot in python? 如何使用scikit中的外部数据执行PCA / LDA? - How do I work with external data in scikit to perform PCA/LDA? 我如何修复错误:“在训练数据上安装 PCA 转换器并转换数据线”和“model = svm.SVC(kernel='linear')” - How I can fix the error in: "Fit the PCA transformer on the training data and transform the data line" and "model = svm.SVC(kernel='linear')" 您如何对光谱数据执行 PCA? - How do you perform PCA on spectral data? 如何强制我的训练数据匹配我的神经网络的 output 形状? - How do I force my training data to match the output shape of my neural network? 如果我的目标答案位于行而不是列中,如何为监督学习实施训练数据? - How do I implement training data for Supervised Learning if my target answers are located in the rows instead of columns? 如何存储拟合的 PCA 以便我可以转置看不见的测试数据集? 我不希望将大型训练数据集保留在我的 CPU 上 - How do I store a fitted PCA so that I may transpose unseen testing dataset? I do not wish to keep the large training dataset on my CPU 如何使用python将我的字节数据转换为整数? - How do I convert my byte data to integer using python?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM