简体   繁体   English

使用 GridSearchCV scikit-learn 进行管道中的 KMeans

[英]KMeans in pipeline with GridSearchCV scikit-learn

I want to perform clustering on my text data.我想对我的文本数据执行聚类。 To find best text preprocessing parameters I made pipeline and put it in GridSearchCV:为了找到最佳的文本预处理参数,我制作了管道并将其放入 GridSearchCV:

text_clf = Pipeline([('vect1', CountVectorizer(analyzer = "word"),
                   ('myfun', MyLemmanization(lemmatize=True,
                                           leave_other_words = True)),
                   ('vect2', CountVectorizer(analyzer = "word",
                                          max_df=0.95, min_df=2,
                                          max_features=2000)),
                   ('tfidf', TfidfTransformer()),
                   ('clust',   KMeans(n_clusters=10, init='k-means++',
                                      max_iter=100, n_init=1, verbose=1))])
parameters = {'myfun__lemmatize': (True, False),
              'myfun__leave_other_words': (True, False)}
gs_clf = GridSearchCV(text_clf, parameters, n_jobs=1, scoring=score)
gs_clf = gs_clf.fit(text_data)

where score score在哪里

score = make_scorer(my_f1, greater_is_better=True)

and my_f1 is of form:my_f1的形式是:

def my_f1(labels_true, labels_pred):
    # fancy stuff goes here

and is specially designed for clustering并且专为聚类设计

So my questions is: how to make that work?所以我的问题是:如何使它起作用? How to pass labels_pred , when as a kmeans nature I can only do如何通过labels_pred ,当作为 kmeans 性质我只能做

gs_clf.fit(data)

while in classification there is possible:而在分类中有可能:

gs_clf.fit(data, labels_true)

I know I can write my custom function, like I did with MyLemmanization :我知道我可以编写自定义函数,就像我在MyLemmanization所做的MyLemmanization

class MyLemmanization(BaseEstimator, TransformerMixin):

    def __init__(self,  lemmatize=True, leave_other_words=True):
        #some code here
    
    def do_something_to(self, X):
        # some code here
        return articles

    def transform(self, X, y=None):
        return self.do_something_to(X)  # where the actual feature extraction happens

    def fit(self, X, y=None):
        return self  # generally does nothing

But how and what has to be done to KMeans or other clustering algorithm?但是如何以及必须对 KMeans 或其他聚类算法做什么?

You can create a custom K-means where you use the labeled data to build the initial centroids and then let K-means do its magic.您可以创建自定义 K-means,在其中使用标记数据构建初始质心,然后让 K-means 发挥其魔力。

You might also want to try k-NN , even though it's a different method.您可能还想尝试k-NN ,即使它是一种不同的方法。

More importantly, you have a conceptual problem.更重要的是,你有一个概念上的问题。 You say one of the reasons you use clustering is because it might find previously unknown topics, but you also say you want to evaluate performance by comparing with known labels.你说你使用聚类的原因之一是因为它可能会发现以前未知的主题,但你也说你想通过与已知标签进行比较来评估性能。 You can't really have both, though...但是,您不能同时拥有两者...

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM