ValueError：输入包含 NaN、无穷大或对于 scikit-learn 的 dtype('float64') 来说太大的值

Question

from sklearn.cluster.bicluster import SpectralCoclustering
from sklearn.feature_extraction.text import TfidfVectorizer
def number_normalizer(tokens):
    """ Map all numeric tokens to a placeholder.
    For many applications, tokens that begin with a number are not directly
    useful, but the fact that such a token exists can be relevant.  By applying
    this form of dimensionality reduction, some methods may perform better.
    """
    return ("#NUMBER" if token[0].isdigit() else token for token in tokens)


class NumberNormalizingVectorizer(TfidfVectorizer):

    def build_tokenizer(self):
        tokenize = super(NumberNormalizingVectorizer, self).build_tokenizer()
        return lambda doc: list(number_normalizer(tokenize(doc)))

vectorizer = NumberNormalizingVectorizer(stop_words='english', min_df=5)
cocluster = SpectralCoclustering(n_clusters=5, svd_method='arpack', random_state=0)
X = vectorizer.fit_transform(data)

cocluster.fit(X) cocluster.fit(X)

I choose SpectralCoclustering to cluster about 30k tweets, everything went well before fit the data X into "cocluster".我选择 SpectralCoclustering 对大约 30k 条推文进行聚类，在将数据 X 放入“cocluster”之前一切都很顺利。

It raise error shows below.它引发错误如下所示。

.env/lib/python3.5/site-packages/sklearn/utils/validation.py", line 43, in _assert_all_finite
    " or a value too large for %r." % X.dtype)
ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

When I typed the code as error reported, but it's "False".当我输入代码作为错误报告时，但它是“错误”。 It should be True when the error occured, right?发生错误时它应该是 True ，对吗？

So is there anything more to find the bug?那么还有什么可以找到错误的吗？ Thanks!谢谢！

https://github.com/scikit-learn/scikit-learn/blob/main/sklearn/utils/validation.py#L43 https://github.com/scikit-learn/scikit-learn/blob/main/sklearn/utils/validation.py#L43

X.dtype.char in np.typecodes['AllFloat'] and not np.isfinite(X.sum()) and not np.isfinite(X).all()

False错误的

Answer 1

When I encountered a similar problem with another sklearn -module today, the following trouble-shooting helped me:当我今天遇到另一个sklearn -module 的类似问题时，以下故障排除帮助了我：

Try to reproduce the error when your input data are not interesting.当您的输入数据不有趣时，尝试重现错误。 Replace you 30k big numbers by 30 integers between 0 and 10.用 0 到 10 之间的 30 个整数替换 30k 个大数字。
(In my case, I was not able to reproduce the error that way.) （就我而言，我无法以这种方式重现错误。）
Check for inf / NaN values in your data.检查数据中的inf / NaN值。 If there are any, replace them with a constant.如果有的话，用常量替换它们。 For example, replace inf by LARGE_NUMBER .例如，将inf替换为LARGE_NUMBER 。
(In my case, the error still did not go away.) （就我而言，错误仍然没有消失。）
Make your constant LARGE_NUMBER smaller.使您的常量LARGE_NUMBER更小。 It might be that the difference between 10^100 and 10^10 is not too important, if your actual data range between -100 and 100.如果您的实际数据范围在 -100 和 100 之间，则 10^100 和 10^10 之间的差异可能不太重要。
(In my case, the error message changed (first success). So I made my LARGE_NUMBER yet smaller, and then the error was gone.) （在我的例子中，错误信息改变了（第一次成功）。所以我把LARGE_NUMBER小，然后错误就消失了。）

I suppose that this behaviour came (in my case) from the fact that the sklearn -module uses the exponential function sometimes.我想这种行为（在我的情况下）是因为sklearn模块有时使用指数函数。 Therefore your method might call another function where one of the parameters is inf , although your input did not contain such values.因此，您的方法可能会调用另一个函数，其中一个参数是inf ，尽管您的输入不包含此类值。

ValueError：输入包含 NaN、无穷大或对于 scikit-learn 的 dtype('float64') 来说太大的值

问题描述

1 个解决方案

解决方案1
0 2021-10-18 14:42:14

ValueError：输入包含 NaN、无穷大或对于 scikit-learn 的 dtype(&#39;float64&#39;) 来说太大的值

问题描述

1 个解决方案

解决方案1 0 2021-10-18 14:42:14

ValueError：输入包含 NaN、无穷大或对于 scikit-learn 的 dtype('float64') 来说太大的值

解决方案1
0 2021-10-18 14:42:14