[英]KMeans in pipeline with GridSearchCV scikit-learn
I want to perform clustering on my text data.我想对我的文本数据执行聚类。 To find best text preprocessing parameters I made pipeline and put it in GridSearchCV:
为了找到最佳的文本预处理参数,我制作了管道并将其放入 GridSearchCV:
text_clf = Pipeline([('vect1', CountVectorizer(analyzer = "word"),
('myfun', MyLemmanization(lemmatize=True,
leave_other_words = True)),
('vect2', CountVectorizer(analyzer = "word",
max_df=0.95, min_df=2,
max_features=2000)),
('tfidf', TfidfTransformer()),
('clust', KMeans(n_clusters=10, init='k-means++',
max_iter=100, n_init=1, verbose=1))])
parameters = {'myfun__lemmatize': (True, False),
'myfun__leave_other_words': (True, False)}
gs_clf = GridSearchCV(text_clf, parameters, n_jobs=1, scoring=score)
gs_clf = gs_clf.fit(text_data)
where score
score
在哪里
score = make_scorer(my_f1, greater_is_better=True)
and my_f1
is of form:而
my_f1
的形式是:
def my_f1(labels_true, labels_pred):
# fancy stuff goes here
and is specially designed for clustering并且专为聚类而设计
So my questions is: how to make that work?所以我的问题是:如何使它起作用? How to pass
labels_pred
, when as a kmeans nature I can only do如何通过
labels_pred
,当作为 kmeans 性质我只能做
gs_clf.fit(data)
while in classification there is possible:而在分类中有可能:
gs_clf.fit(data, labels_true)
I know I can write my custom function, like I did with MyLemmanization
:我知道我可以编写自定义函数,就像我在
MyLemmanization
所做的MyLemmanization
:
class MyLemmanization(BaseEstimator, TransformerMixin):
def __init__(self, lemmatize=True, leave_other_words=True):
#some code here
def do_something_to(self, X):
# some code here
return articles
def transform(self, X, y=None):
return self.do_something_to(X) # where the actual feature extraction happens
def fit(self, X, y=None):
return self # generally does nothing
But how and what has to be done to KMeans or other clustering algorithm?但是如何以及必须对 KMeans 或其他聚类算法做什么?
You can create a custom K-means where you use the labeled data to build the initial centroids and then let K-means do its magic.您可以创建自定义 K-means,在其中使用标记数据构建初始质心,然后让 K-means 发挥其魔力。
You might also want to try k-NN , even though it's a different method.您可能还想尝试k-NN ,即使它是一种不同的方法。
More importantly, you have a conceptual problem.更重要的是,你有一个概念上的问题。 You say one of the reasons you use clustering is because it might find previously unknown topics, but you also say you want to evaluate performance by comparing with known labels.
你说你使用聚类的原因之一是因为它可能会发现以前未知的主题,但你也说你想通过与已知标签进行比较来评估性能。 You can't really have both, though...
但是,您不能同时拥有两者...
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.