I want to perform clustering on my text data. To find best text preprocessing parameters I made pipeline and put it in GridSearchCV:
text_clf = Pipeline([('vect1', CountVectorizer(analyzer = "word"),
('myfun', MyLemmanization(lemmatize=True,
leave_other_words = True)),
('vect2', CountVectorizer(analyzer = "word",
max_df=0.95, min_df=2,
max_features=2000)),
('tfidf', TfidfTransformer()),
('clust', KMeans(n_clusters=10, init='k-means++',
max_iter=100, n_init=1, verbose=1))])
parameters = {'myfun__lemmatize': (True, False),
'myfun__leave_other_words': (True, False)}
gs_clf = GridSearchCV(text_clf, parameters, n_jobs=1, scoring=score)
gs_clf = gs_clf.fit(text_data)
where score
score = make_scorer(my_f1, greater_is_better=True)
and my_f1
is of form:
def my_f1(labels_true, labels_pred):
# fancy stuff goes here
and is specially designed for clustering
So my questions is: how to make that work? How to pass labels_pred
, when as a kmeans nature I can only do
gs_clf.fit(data)
while in classification there is possible:
gs_clf.fit(data, labels_true)
I know I can write my custom function, like I did with MyLemmanization
:
class MyLemmanization(BaseEstimator, TransformerMixin):
def __init__(self, lemmatize=True, leave_other_words=True):
#some code here
def do_something_to(self, X):
# some code here
return articles
def transform(self, X, y=None):
return self.do_something_to(X) # where the actual feature extraction happens
def fit(self, X, y=None):
return self # generally does nothing
But how and what has to be done to KMeans or other clustering algorithm?
You can create a custom K-means where you use the labeled data to build the initial centroids and then let K-means do its magic.
You might also want to try k-NN , even though it's a different method.
More importantly, you have a conceptual problem. You say one of the reasons you use clustering is because it might find previously unknown topics, but you also say you want to evaluate performance by comparing with known labels. You can't really have both, though...
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.