简体   繁体   中英

KMeans in pipeline with GridSearchCV scikit-learn

I want to perform clustering on my text data. To find best text preprocessing parameters I made pipeline and put it in GridSearchCV:

text_clf = Pipeline([('vect1', CountVectorizer(analyzer = "word"),
                   ('myfun', MyLemmanization(lemmatize=True,
                                           leave_other_words = True)),
                   ('vect2', CountVectorizer(analyzer = "word",
                                          max_df=0.95, min_df=2,
                                          max_features=2000)),
                   ('tfidf', TfidfTransformer()),
                   ('clust',   KMeans(n_clusters=10, init='k-means++',
                                      max_iter=100, n_init=1, verbose=1))])
parameters = {'myfun__lemmatize': (True, False),
              'myfun__leave_other_words': (True, False)}
gs_clf = GridSearchCV(text_clf, parameters, n_jobs=1, scoring=score)
gs_clf = gs_clf.fit(text_data)

where score

score = make_scorer(my_f1, greater_is_better=True)

and my_f1 is of form:

def my_f1(labels_true, labels_pred):
    # fancy stuff goes here

and is specially designed for clustering

So my questions is: how to make that work? How to pass labels_pred , when as a kmeans nature I can only do

gs_clf.fit(data)

while in classification there is possible:

gs_clf.fit(data, labels_true)

I know I can write my custom function, like I did with MyLemmanization :

class MyLemmanization(BaseEstimator, TransformerMixin):

    def __init__(self,  lemmatize=True, leave_other_words=True):
        #some code here
    
    def do_something_to(self, X):
        # some code here
        return articles

    def transform(self, X, y=None):
        return self.do_something_to(X)  # where the actual feature extraction happens

    def fit(self, X, y=None):
        return self  # generally does nothing

But how and what has to be done to KMeans or other clustering algorithm?

You can create a custom K-means where you use the labeled data to build the initial centroids and then let K-means do its magic.

You might also want to try k-NN , even though it's a different method.

More importantly, you have a conceptual problem. You say one of the reasons you use clustering is because it might find previously unknown topics, but you also say you want to evaluate performance by comparing with known labels. You can't really have both, though...

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM