[英]ValueError: Input contains NaN, infinity or a value too large for dtype('float64') with scikit-learn
from sklearn.cluster.bicluster import SpectralCoclustering
from sklearn.feature_extraction.text import TfidfVectorizer
def number_normalizer(tokens):
""" Map all numeric tokens to a placeholder.
For many applications, tokens that begin with a number are not directly
useful, but the fact that such a token exists can be relevant. By applying
this form of dimensionality reduction, some methods may perform better.
"""
return ("#NUMBER" if token[0].isdigit() else token for token in tokens)
class NumberNormalizingVectorizer(TfidfVectorizer):
def build_tokenizer(self):
tokenize = super(NumberNormalizingVectorizer, self).build_tokenizer()
return lambda doc: list(number_normalizer(tokenize(doc)))
vectorizer = NumberNormalizingVectorizer(stop_words='english', min_df=5)
cocluster = SpectralCoclustering(n_clusters=5, svd_method='arpack', random_state=0)
X = vectorizer.fit_transform(data)
cocluster.fit(X) cocluster.fit(X)
I choose SpectralCoclustering to cluster about 30k tweets, everything went well before fit the data X into "cocluster".我选择 SpectralCoclustering 对大约 30k 条推文进行聚类,在将数据 X 放入“cocluster”之前一切都很顺利。
It raise error shows below.它引发错误如下所示。
.env/lib/python3.5/site-packages/sklearn/utils/validation.py", line 43, in _assert_all_finite
" or a value too large for %r." % X.dtype)
ValueError: Input contains NaN, infinity or a value too large for dtype('float64').
When I typed the code as error reported, but it's "False".当我输入代码作为错误报告时,但它是“错误”。 It should be True when the error occured, right?
发生错误时它应该是 True ,对吗?
So is there anything more to find the bug?那么还有什么可以找到错误的吗? Thanks!
谢谢!
https://github.com/scikit-learn/scikit-learn/blob/main/sklearn/utils/validation.py#L43 https://github.com/scikit-learn/scikit-learn/blob/main/sklearn/utils/validation.py#L43
X.dtype.char in np.typecodes['AllFloat'] and not np.isfinite(X.sum()) and not np.isfinite(X).all()
False错误的
When I encountered a similar problem with another sklearn
-module today, the following trouble-shooting helped me:当我今天遇到另一个
sklearn
-module 的类似问题时,以下故障排除帮助了我:
inf
/ NaN
values in your data.inf
/ NaN
值。 If there are any, replace them with a constant.inf
by LARGE_NUMBER
.inf
替换为LARGE_NUMBER
。LARGE_NUMBER
smaller.LARGE_NUMBER
更小。 It might be that the difference between 10^100 and 10^10 is not too important, if your actual data range between -100 and 100.LARGE_NUMBER
yet smaller, and then the error was gone.) LARGE_NUMBER
小,然后错误就消失了。) I suppose that this behaviour came (in my case) from the fact that the sklearn
-module uses the exponential function sometimes.我想这种行为(在我的情况下)是因为
sklearn
模块有时使用指数函数。 Therefore your method might call another function where one of the parameters is inf
, although your input did not contain such values.因此,您的方法可能会调用另一个函数,其中一个参数是
inf
,尽管您的输入不包含此类值。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.