[英]How do I convert new data into the PCA components of my training data?
Suppose I have some text sentences that I want to cluster using kmeans. 假设我有一些文本句子,我想用kmeans进行聚类。
sentences = [
"fix grammatical or spelling errors",
"clarify meaning without changing it",
"correct minor mistakes",
"add related resources or links",
"always respect the original author"
]
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.cluster import KMeans
vectorizer = CountVectorizer(min_df=1)
X = vectorizer.fit_transform(sentences)
num_clusters = 2
km = KMeans(n_clusters=num_clusters, init='random', n_init=1,verbose=1)
km.fit(X)
Now I could predict which of the classes a new text would fall into, 现在我可以预测新文本会落入哪个类,
new_text = "hello world"
vec = vectorizer.transform([new_text])
print km.predict(vec)[0]
However, say I apply PCA to reduce 10,000 features to 50. 但是,我说应用PCA将10,000个功能减少到50个。
from sklearn.decomposition import RandomizedPCA
pca = RandomizedPCA(n_components=50,whiten=True)
X2 = pca.fit_transform(X)
km.fit(X2)
I cannot do the same thing anymore to predict the cluster for a new text because the results from vectorizer are no longer relevant 我不能再做同样的事情来预测新文本的集群,因为矢量化器的结果不再相关
new_text = "hello world"
vec = vectorizer.transform([new_text]) ##
print km.predict(vec)[0]
ValueError: Incorrect number of features. Got 10000 features, expected 50
So how do I transform my new text into the lower dimensional feature space? 那么如何将我的新文本转换为低维特征空间?
You want to use pca.transform
on your new data before feeding it to the model. 您希望在将新数据提供给模型之前对其使用pca.transform
。 This will perform dimensionality reduction using the same PCA model that was fitted when you ran pca.fit_transform
on your original data. 这将使用您在原始数据上运行pca.fit_transform
时安装的相同PCA模型来执行pca.fit_transform
。 You can then use your fitted model to predict on this reduced data. 然后,您可以使用拟合模型来预测此减少的数据。
Basically, think of it as fitting one large model, which consists of stacking three smaller models. 基本上,将其视为适合一个大型模型,其中包括堆叠三个较小的模型。 First you have a CountVectorizer
model that determines how to process data. 首先,您有一个CountVectorizer
模型,用于确定如何处理数据。 Then you run a RandomizedPCA
model that performs dimensionality reduction. 然后运行一个执行降维的RandomizedPCA
模型。 And finally you run a KMeans
model for clustering. 最后,您运行KMeans
模型进行群集。 When you fit the models, you go down the stack and fit each one. 当您适合模型时,您可以沿着堆栈向下移动每个模型。 And when you want to do prediction, you also have to go down the stack and apply each one. 当你想做预测时,你也必须下去并应用每一个。
# Initialize models
vectorizer = CountVectorizer(min_df=1)
pca = RandomizedPCA(n_components=50, whiten=True)
km = KMeans(n_clusters=2, init='random', n_init=1, verbose=1)
# Fit models
X = vectorizer.fit_transform(sentences)
X2 = pca.fit_transform(X)
km.fit(X2)
# Predict with models
X_new = vectorizer.transform(["hello world"])
X2_new = pca.transform(X_new)
km.predict(X2_new)
Use a Pipeline
: 使用Pipeline
:
>>> from sklearn.cluster import KMeans
>>> from sklearn.decomposition import RandomizedPCA
>>> from sklearn.decomposition import TruncatedSVD
>>> from sklearn.feature_extraction.text import CountVectorizer
>>> from sklearn.pipeline import make_pipeline
>>> sentences = [
... "fix grammatical or spelling errors",
... "clarify meaning without changing it",
... "correct minor mistakes",
... "add related resources or links",
... "always respect the original author"
... ]
>>> vectorizer = CountVectorizer(min_df=1)
>>> svd = TruncatedSVD(n_components=5)
>>> km = KMeans(n_clusters=2, init='random', n_init=1)
>>> pipe = make_pipeline(vectorizer, svd, km)
>>> pipe.fit(sentences)
Pipeline(steps=[('countvectorizer', CountVectorizer(analyzer=u'word', binary=False, decode_error=u'strict',
dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',
lowercase=True, max_df=1.0, max_features=None, min_df=1,
ngram_range=(1, 1), preprocessor=None, stop_words=None,...n_init=1,
n_jobs=1, precompute_distances='auto', random_state=None, tol=0.0001,
verbose=1))])
>>> pipe.predict(["hello, world"])
array([0], dtype=int32)
(Showing TruncatedSVD
because RandomizedPCA
will stop working on text frequency matrices in an upcoming release; it actually performed an SVD, not full PCA, anyway.) (显示TruncatedSVD
因为RandomizedPCA
将在即将发布的版本中停止处理文本频率矩阵;它实际上执行了SVD,而不是完整的PCA。)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.