简体   繁体   中英

Is statsmodels not compatible with Dask dataframe when used in any of the machine learning model it offers?

I am trying statsmodels to fit my data to a Logistic Regression model (Logit) but the dataframe I have is not a pandas dataframe but a Dask dataframe.

This is my sample dataset: smarket_1 :

Response Variable: Direction

    const   Year    Lag1    Lag2    Lag3    Lag4    Lag5    Volume  Today   Direction
0   1.0 2001.0  0.381   -0.192  -2.624  -1.055  5.010   1.1913  0.959   1.0
1   1.0 2001.0  0.959   0.381   -0.192  -2.624  -1.055  1.2965  1.032   1.0
2   1.0 2001.0  1.032   0.959   0.381   -0.192  -2.624  1.4112  -0.623  0.0
3   1.0 2001.0  -0.623  1.032   0.959   0.381   -0.192  1.2760  0.614   1.0
4   1.0 2001.0  0.614   -0.623  1.032   0.959   0.381   1.2057  0.213   1.0
5   1.0 2001.0  0.213   0.614   -0.623  1.032   0.959   1.3491  1.392   1.0
6   1.0 2001.0  1.392   0.213   0.614   -0.623  1.032   1.4450  -0.403  0.0
7   1.0 2001.0  -0.403  1.392   0.213   0.614   -0.623  1.4078  0.027   1.0
8   1.0 2001.0  0.027   -0.403  1.392   0.213   0.614   1.1640  1.303   1.0
9   1.0 2001.0  1.303   0.027   -0.403  1.392   0.213   1.2326  0.287   1.0
10  1.0 2001.0  0.287   1.303   0.027   -0.403  1.392   1.3090  -0.498  0.0
11  1.0 2001.0  -0.498  0.287   1.303   0.027   -0.403  1.2580  -0.189  0.0
12  1.0 2001.0  -0.189  -0.498  0.287   1.303   0.027   1.0980  0.680   1.0
13  1.0 2001.0  0.680   -0.189  -0.498  0.287   1.303   1.0531  0.701   1.0
14  1.0 2001.0  0.701   0.680   -0.189  -0.498  0.287   1.1498  -0.562  0.0
15  1.0 2001.0  -0.562  0.701   0.680   -0.189  -0.498  1.2953  0.546   1.0
16  1.0 2001.0  0.546   -0.562  0.701   0.680   -0.189  1.1188  -1.747  0.0
17  1.0 2001.0  -1.747  0.546   -0.562  0.701   0.680   1.0484  0.359   1.0
18  1.0 2001.0  0.359   -1.747  0.546   -0.562  0.701   1.0130  -0.151  0.0
19  1.0 2001.0  -0.151  0.359   -1.747  0.546   -0.562  1.0596  -0.841  0.0

So, when I use the Logit class from statsmodels and fit my data:

from statsmodels.api import Logit

logistict_reg = Logit(endog = smarket_1['Direction'], exog = smarket_1.drop(labels= 'Direction', axis = 1)).fit()
logistic_reg.summary()

I am getting the below error saying:

ValueError: unrecognized data structures: <class 'dask.dataframe.core.DataFrame'> / <class 'dask.dataframe.core.DataFrame'>

Next, when I tried converting the dask dataframe to a pandas one using .compute() as follows:

from statsmodels.api import Logit

logistict_reg = Logit(endog = smarket_1['Direction'], exog = smarket_1.drop(labels= 'Direction', axis = 1).compute()).fit()

I am getting error saying:

AttributeError: 'Index' object has no attribute 'equals'

However, when I passed the same dask dataframe to sklearn's Logistic Regression model it worked witout any error.

So does Statsmodels doesnt support/works with Dask dataframe?

No - you can't use scikit-learn or statsmodels with dask arrays or dataframes. These libraries are based on numpy data structures and have no support for out-of-core or delayed operations.

Instead, use the library dask-ml , which is party of the dask ecosystem, works directly with these data structures, and is designed to be similar to these numpy-based frameworks, but using the dask scheduler.

Note that some algorithms you may be working with do not scale well (or at all) to parallel or partitioned datasets. Dask-ml has implemented a number of algorithms which are similar, but use approximation or sampling methods to achieve similar (but not identical) results. So be prepared to read up on the available methods and to be flexible in your need for exact solutions. Otherwise, your only option is to use a machine with more memory and compute the collection so you can use the numpy-based libraries.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM