简体   繁体   English

具有Statsmodel ValueError的多个OLS回归:零大小数组到减少操作最大值,没有标识

[英]Multiple OLS Regression with Statsmodel ValueError: zero-size array to reduction operation maximum which has no identity

I am having a problem performing Multiple Regression on a dataset containing around 7500 data points with missing data (NaN) in some columns and rows. 我在包含大约7500个数据点的数据集上执行多元回归时遇到问题,这些数据点在某些列和行中缺少数据(NaN)。 There is at least one NaN value in each row. 每行至少有一个NaN值。 Some rows contain only NaN values. 某些行仅包含NaN值。

I am using OLS Statsmodel for the regression analysis. 我正在使用OLS Statsmodel进行回归分析。 I'm trying not to use Scikit Learn to perform OLS regression because (I might be wrong about this but) I'd have to impute the missing data in my dataset, which would distort the dataset to a certain extent. 我正在尝试不使用Scikit Learn来执行OLS回归,因为(我可能错了)但是我必须将数据集中的缺失数据归咎于数据集,这会在一定程度上扭曲数据集。

My dataset looks like this: KPI 我的数据集如下所示: KPI

This is what I did (target variable is KP6, predictor variables are the remaining variables): 这就是我所做的(目标变量是KP6,预测变量是剩下的变量):

est2 = ols(formula = KPI.KPI6.name + ' ~ ' + ' + '.join(KPI.drop('KPI6', axis = 1).columns.tolist()), data = KPI).fit()

And it returns a ValueError: zero-size array to reduction operation maximum which has no identity. 并且它返回一个ValueError:零大小数组到减少操作最大值,它没有标识。

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-207-b24ba316a452> in <module>()
      3 #test = KPI.dropna(how='all')
      4 #test = KPI.fillna(0)
----> 5 est2 = ols(formula = KPI.KPI6.name + ' ~ ' + ' + '.join(KPI.drop('KPI6', axis = 1).columns.tolist()), data = KPI).fit()
      6 print(est2.summary())

/Users/anhtran/anaconda/lib/python3.6/site-packages/statsmodels/base/model.py in from_formula(cls, formula, data, subset, drop_cols, *args, **kwargs)
    172                        'formula': formula,  # attach formula for unpckling
    173                        'design_info': design_info})
--> 174         mod = cls(endog, exog, *args, **kwargs)
    175         mod.formula = formula
    176 

/Users/anhtran/anaconda/lib/python3.6/site-packages/statsmodels/regression/linear_model.py in __init__(self, endog, exog, missing, hasconst, **kwargs)
    629                  **kwargs):
    630         super(OLS, self).__init__(endog, exog, missing=missing,
--> 631                                   hasconst=hasconst, **kwargs)
    632         if "weights" in self._init_keys:
    633             self._init_keys.remove("weights")

/Users/anhtran/anaconda/lib/python3.6/site-packages/statsmodels/regression/linear_model.py in __init__(self, endog, exog, weights, missing, hasconst, **kwargs)
    524             weights = weights.squeeze()
    525         super(WLS, self).__init__(endog, exog, missing=missing,
--> 526                                   weights=weights, hasconst=hasconst, **kwargs)
    527         nobs = self.exog.shape[0]
    528         weights = self.weights

/Users/anhtran/anaconda/lib/python3.6/site-packages/statsmodels/regression/linear_model.py in __init__(self, endog, exog, **kwargs)
     93     """
     94     def __init__(self, endog, exog, **kwargs):
---> 95         super(RegressionModel, self).__init__(endog, exog, **kwargs)
     96         self._data_attr.extend(['pinv_wexog', 'wendog', 'wexog', 'weights'])
     97 

/Users/anhtran/anaconda/lib/python3.6/site-packages/statsmodels/base/model.py in __init__(self, endog, exog, **kwargs)
    210 
    211     def __init__(self, endog, exog=None, **kwargs):
--> 212         super(LikelihoodModel, self).__init__(endog, exog, **kwargs)
    213         self.initialize()
    214 

/Users/anhtran/anaconda/lib/python3.6/site-packages/statsmodels/base/model.py in __init__(self, endog, exog, **kwargs)
     61         hasconst = kwargs.pop('hasconst', None)
     62         self.data = self._handle_data(endog, exog, missing, hasconst,
---> 63                                       **kwargs)
     64         self.k_constant = self.data.k_constant
     65         self.exog = self.data.exog

/Users/anhtran/anaconda/lib/python3.6/site-packages/statsmodels/base/model.py in _handle_data(self, endog, exog, missing, hasconst, **kwargs)
     86 
     87     def _handle_data(self, endog, exog, missing, hasconst, **kwargs):
---> 88         data = handle_data(endog, exog, missing, hasconst, **kwargs)
     89         # kwargs arrays could have changed, easier to just attach here
     90         for key in kwargs:

/Users/anhtran/anaconda/lib/python3.6/site-packages/statsmodels/base/data.py in handle_data(endog, exog, missing, hasconst, **kwargs)
    628     klass = handle_data_class_factory(endog, exog)
    629     return klass(endog, exog=exog, missing=missing, hasconst=hasconst,
--> 630                  **kwargs)

/Users/anhtran/anaconda/lib/python3.6/site-packages/statsmodels/base/data.py in __init__(self, endog, exog, missing, hasconst, **kwargs)
     77 
     78         # this has side-effects, attaches k_constant and const_idx
---> 79         self._handle_constant(hasconst)
     80         self._check_integrity()
     81         self._cache = resettable_cache()

/Users/anhtran/anaconda/lib/python3.6/site-packages/statsmodels/base/data.py in _handle_constant(self, hasconst)
    129             # detect where the constant is
    130             check_implicit = False
--> 131             const_idx = np.where(self.exog.ptp(axis=0) == 0)[0].squeeze()
    132             self.k_constant = const_idx.size
    133 

ValueError: zero-size array to reduction operation maximum which has no identity

I suspected that the error arose due to the target variable (ie KPI6) containing some NaNs, so I tried dropping all rows with KPI6 = NaN like this but the problem still persists: 我怀疑错误是由于包含一些NaN的目标变量(即KPI6)引起的,所以我尝试使用KPI6 = NaN这样丢弃所有行,但问题仍然存在:

KPI.dropna(subset = ['KPI6'])

I also tried dropping all rows that contain only NaN values but the problem still persists: 我也试过删除只包含NaN值的所有行,但问题仍然存在:

KPI.dropna(how = 'all')

I combined both steps above and the problem still persists. 我结合上面的两个步骤,问题仍然存在。 The only way to eliminate this error is to actually impute the missing data with something (eg 0, mean, median, etc.). 消除此错误的唯一方法是实际用某些东西(例如0,均值,中位数等)来估算缺失的数据。 However, I'm hoping to avoid this method as much as possible, because I want to perform OLS regression on the original data. 但是,我希望尽可能避免使用这种方法,因为我想对原始数据执行OLS回归。

OLS regression also works when I tried selecting only a few variables as predictor variables, but this again is not what I aim to do. 当我尝试仅选择几个变量作为预测变量时,OLS回归也有效,但这不再是我的目标。 I want to include all other variables besides KPI6 as predictor variables. 我想包括除KPI6之外的所有其他变量作为预测变量。

Is there any solution to this? 这有什么解决方案吗? I've been really stressed out over this for one week. 我已经非常紧张了一个星期。 Any help is appreciated. 任何帮助表示赞赏。 I'm not a pro Python coder so I'd appreciate it if you can break down the problem (& suggest a solution) in layman's terms. 我不是一个专业的Python编码器,所以如果你能用外行的话来解决问题(并建议一个解决方案),我会很感激。

Thanks so much in advance. 非常感谢提前。

The default missing handling when using formulas is to drop any row that contains at least one nan. 使用公式时的默认缺失处理是删除包含至少一个nan的任何行。 If each row contains a nan, then there are no observations left. 如果每行包含一个nan,则没有任何观察结果。 I think that's what the end of the traceback ValueError: zero-size array means. 我认为这就是回溯ValueError: zero-size array的结尾ValueError: zero-size array意味着什么。

If you have enough data overall, then you can try imputing and estimating with MICE which will impute iteratively the missing values for each variable. 如果您有足够的数据,那么您可以尝试使用MICE进行估算和估算,这将迭代地计算每个变量的缺失值。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 OLS回归存储问题:零大小数组到归约操作最大值,没有身份 - OLS regression storing problem: zero-size array to reduction operation maximum which has no identity 零大小数组到没有标识的最大归约操作 - zero-size array to reduction operation maximum which has no identity SciPy optimize.fmin ValueError:零大小数组到减少操作最大值,没有标识 - SciPy optimize.fmin ValueError: zero-size array to reduction operation maximum which has no identity 如何将 np.max 用于没有 ValueError 的空 numpy 数组:零大小数组到没有标识的缩减操作最大值 - how to use np.max for empty numpy array without ValueError: zero-size array to reduction operation maximum which has no identity 为什么我会得到以及如何解决 ValueError: zero-size array to reduction operation maximum which has no identity - Why am I getting and how to solve ValueError: zero-size array to reduction operation maximum which has no identity 如何修复&#39;ValueError:零尺寸数组到没有身份的归约运算fmin&#39; - How to fix 'ValueError: zero-size array to reduction operation fmin which has no identity' ValueError:零大小数组到没有标识的最小化操作 - ValueError: zero-size array to reduction operation minimum which has no identity Seaborn ValueError:零大小数组到减少操作最小值,没有标识 - Seaborn ValueError: zero-size array to reduction operation minimum which has no identity 如何修复 keras pad_sequences 中的“零大小数组到没有标识的最大缩减操作” - How to fix "zero-size array to reduction operation maximum which has no identity" in keras pad_sequences 零大小数组到最大归约操作,对于多输出 U-net 没有标识 - zero-size array to reduction operation maximum which has no identity for multi output U-net
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM