from sklearn.cluster.bicluster import SpectralCoclustering
from sklearn.feature_extraction.text import TfidfVectorizer
def number_normalizer(tokens):
""" Map all numeric tokens to a placeholder.
For many applications, tokens that begin with a number are not directly
useful, but the fact that such a token exists can be relevant. By applying
this form of dimensionality reduction, some methods may perform better.
"""
return ("#NUMBER" if token[0].isdigit() else token for token in tokens)
class NumberNormalizingVectorizer(TfidfVectorizer):
def build_tokenizer(self):
tokenize = super(NumberNormalizingVectorizer, self).build_tokenizer()
return lambda doc: list(number_normalizer(tokenize(doc)))
vectorizer = NumberNormalizingVectorizer(stop_words='english', min_df=5)
cocluster = SpectralCoclustering(n_clusters=5, svd_method='arpack', random_state=0)
X = vectorizer.fit_transform(data)
cocluster.fit(X)
I choose SpectralCoclustering to cluster about 30k tweets, everything went well before fit the data X into "cocluster".
It raise error shows below.
.env/lib/python3.5/site-packages/sklearn/utils/validation.py", line 43, in _assert_all_finite
" or a value too large for %r." % X.dtype)
ValueError: Input contains NaN, infinity or a value too large for dtype('float64').
When I typed the code as error reported, but it's "False". It should be True when the error occured, right?
So is there anything more to find the bug? Thanks!
https://github.com/scikit-learn/scikit-learn/blob/main/sklearn/utils/validation.py#L43
X.dtype.char in np.typecodes['AllFloat'] and not np.isfinite(X.sum()) and not np.isfinite(X).all()
False
When I encountered a similar problem with another sklearn
-module today, the following trouble-shooting helped me:
inf
/ NaN
values in your data. If there are any, replace them with a constant. For example, replace inf
by LARGE_NUMBER
.LARGE_NUMBER
smaller. It might be that the difference between 10^100 and 10^10 is not too important, if your actual data range between -100 and 100.LARGE_NUMBER
yet smaller, and then the error was gone.) I suppose that this behaviour came (in my case) from the fact that the sklearn
-module uses the exponential function sometimes. Therefore your method might call another function where one of the parameters is inf
, although your input did not contain such values.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.