简体   繁体   中英

ValueError: Input contains NaN, infinity or a value too large for dtype('float64') with scikit-learn

from sklearn.cluster.bicluster import SpectralCoclustering
from sklearn.feature_extraction.text import TfidfVectorizer
def number_normalizer(tokens):
    """ Map all numeric tokens to a placeholder.
    For many applications, tokens that begin with a number are not directly
    useful, but the fact that such a token exists can be relevant.  By applying
    this form of dimensionality reduction, some methods may perform better.
    """
    return ("#NUMBER" if token[0].isdigit() else token for token in tokens)


class NumberNormalizingVectorizer(TfidfVectorizer):

    def build_tokenizer(self):
        tokenize = super(NumberNormalizingVectorizer, self).build_tokenizer()
        return lambda doc: list(number_normalizer(tokenize(doc)))

vectorizer = NumberNormalizingVectorizer(stop_words='english', min_df=5)
cocluster = SpectralCoclustering(n_clusters=5, svd_method='arpack', random_state=0)
X = vectorizer.fit_transform(data)

cocluster.fit(X)

I choose SpectralCoclustering to cluster about 30k tweets, everything went well before fit the data X into "cocluster".

It raise error shows below.

.env/lib/python3.5/site-packages/sklearn/utils/validation.py", line 43, in _assert_all_finite
    " or a value too large for %r." % X.dtype)
ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

When I typed the code as error reported, but it's "False". It should be True when the error occured, right?

So is there anything more to find the bug? Thanks!

https://github.com/scikit-learn/scikit-learn/blob/main/sklearn/utils/validation.py#L43

X.dtype.char in np.typecodes['AllFloat'] and not np.isfinite(X.sum()) and not np.isfinite(X).all()

False

When I encountered a similar problem with another sklearn -module today, the following trouble-shooting helped me:

  • Try to reproduce the error when your input data are not interesting. Replace you 30k big numbers by 30 integers between 0 and 10.
    (In my case, I was not able to reproduce the error that way.)
  • Check for inf / NaN values in your data. If there are any, replace them with a constant. For example, replace inf by LARGE_NUMBER .
    (In my case, the error still did not go away.)
  • Make your constant LARGE_NUMBER smaller. It might be that the difference between 10^100 and 10^10 is not too important, if your actual data range between -100 and 100.
    (In my case, the error message changed (first success). So I made my LARGE_NUMBER yet smaller, and then the error was gone.)

I suppose that this behaviour came (in my case) from the fact that the sklearn -module uses the exponential function sometimes. Therefore your method might call another function where one of the parameters is inf , although your input did not contain such values.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM