[英]Error in fit_transform: Input contains NaN, infinity or a value too large for dtype('float64')
I have a dataframe of shape (14407, 2564). 我有一个形状为(14407,2564)的数据框。 I am trying to remove low variance features using the VarianceThreshold function.
我正在尝试使用VarianceThreshold函数删除低方差特征。 However, when I call fit_transform, I get the following error:
但是,当我调用fit_transform时,出现以下错误:
ValueError: Input contains NaN, infinity or a value too large for dtype('float64'). ValueError:输入包含NaN,无穷大或dtype('float64')太大的值。
Before usign VarianceThreshold, I replaces all the missing value from my df using the below code: 在使用Uign VarianceThreshold之前,我使用以下代码替换了df中所有缺少的值:
df.replace('null',np.NaN, inplace=True)
df.replace(r'^\s*$', np.NaN, regex=True, inplace=True)
df.fillna(value=df.median(), inplace=True)
I checked my dataframe afterwards for any empty/infinite values using: 之后,我使用以下方法检查了数据框是否有任何空/无限值:
m = df.isnull().any()
print "========= COLUMNS WITH NULL VALUES ================="
print m[m]
print "========= COLUMNS WITH INFINITE VALUES ================="
m = np.isfinite(df.select_dtypes(include=['float64'])).any()
print m[m]
and I got an empty Series as an output, which means all my columns do not have any missing values. 并且我得到一个空的Series作为输出,这意味着我所有的列都没有缺失值。 The output was:
输出为:
========= COLUMNS WITH NULL VALUES =================
Series([], dtype: bool)
========= COLUMNS WITH INFINITE VALUES =================
Series([], dtype: bool)
Full error trace: 完整的错误跟踪:
Traceback (most recent call last):
File "/home/users/MyUsername/MyProject/src/main/python/Main.py", line 222, in <module>
main()
File "/home/users/MyUsername/MyProject/src/main/python/Main.py", line 218, in main
getAllData()
File "/home/users/MyUsername/MyProject/src/main/python/Main.py", line 95, in getAllData
predictors, labels, dropped_features = fselector.process(variance=True, corr=True, bestf=True, bestfk=200)
File "/home/users/MyUsername/MyProject/src/main/python/classes/featureselector.py", line 54, in process
self.getVariance(threshold=(.95 * (1 - .95)))
File "/home/users/MyUsername/MyProject/src/main/python/classes/featureselector.py", line 136, in getVariance
self.removeLowVarianceColumns(df=self.X, thresh=threshold)
File "/home/users/MyUsername/MyProject/src/main/python/classes/featureselector.py", line 213, in removeLowVarianceColumns
selector.fit_transform(df)
File "/usr/lib64/python2.7/site-packages/sklearn/base.py", line 494, in fit_transform
return self.fit(X, **fit_params).transform(X)
File "/usr/lib64/python2.7/site-packages/sklearn/feature_selection/variance_threshold.py", line 64, in fit
X = check_array(X, ('csr', 'csc'), dtype=np.float64)
File "/usr/lib64/python2.7/site-packages/sklearn/utils/validation.py", line 407, in check_array
_assert_all_finite(array)
File "/usr/lib64/python2.7/site-packages/sklearn/utils/validation.py", line 58, in _assert_all_finite
" or a value too large for %r." % X.dtype)
ValueError: Input contains NaN, infinity or a value too large for dtype('float64').
So, I am not sure what to check, this does not seem like a missing value issue, but I am also not able to get what columns/values are causing the problem. 因此,我不确定要检查什么,这似乎不是缺少值的问题,但是我也无法获取导致问题的列/值。
I've seen several threads here that all end in having a missing value, but that does not seem to be the problem here. 我在这里看到几个线程都以缺少值结尾,但这似乎不是问题所在。
I solved this by casting my data to numeric. 我通过将数据转换为数字来解决此问题。 It appears that, although the error message states 'float64', my data was all objects only and objects did not work well with fit_transform.
看起来,尽管错误消息显示为“ float64”,但我的数据仅是所有对象,而对象与fit_transform不能很好地配合使用。
Changing my data to float using: df = df.apply(lambda x: pd.to_numeric(x,errors='ignore'))
solved the issue. 使用
df = df.apply(lambda x: pd.to_numeric(x,errors='ignore'))
将我的数据更改为浮动。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.