简体   繁体   中英

Input format for logistic regression in scikit-learn as in R

When Using logistic regression in R , the data input for the 'glm' function (family = binomial) can be: (?family) in several formats, and specifically in the format of:

......

For the binomial and quasibinomial families the response can be specified in one of three ways:

......

As a numerical vector with values between 0 and 1, interpreted as the proportion of successful cases (with the total number of cases given by the weights)....

I have aggregated data that represents proportion of success out of trials (number between 0 and 1) and their equivalent weights, I'm interested in applying logistic regression with it, which would be trivial to use in R.

Unfortunately i cant use R in this project, and would like to use scikit-learn to estimate the logistic regression coefficients . More precise, i'm looking to apply the sklearn.linear_model.LogisticRegression in a form of input that will allow me to insert the model proportions and wights, in a similar fashion as available in R.

example:

from sklearn import linear_model
import pandas as pd

df = pd.DataFrame([[1,1,1,0], [1,1,1,0],[1,1,1,1],[2,2,1,1] , [2,2,1,1],[2,2,1,0] , [3,3,1,0] ],columns=['a', 'b','Trials','Success'])

logistic = linear_model.LogisticRegression()
#this works
logistic.fit(X=df[['a','b','Trials']] , y=df.Success)
logistic.predict_proba(df[['a','b','Trials']])
prob_to_success = logistic.predict_proba(df[['a','b','Trials']])[:,1]


    prob_to_success

Out[51]:  array([ 0.45535843,  0.45535843,  0.45535843,  0.42212169,  0.42212169,
        0.42212169,  0.38957565])

#How can i use the following Data?
df_agg = df.groupby(['a','b'] , as_index=False)['Trials','Success'].sum()
df_agg["Prop"] = df_agg.Success / (df_agg.Trials)
df_agg

 #I want to use Prop & Trials as weights in df_agg

Thanks in advance!

Convert to log-odds form and use linear regression on the transformation. Sklearn doesn't seem to have a quasi-binomial conversion for logistic regression. As you said, trivial in R but sklearn seems to not have anything of the sort.

如果要使用权重,可以在LogisticRegression的拟合函数中使用它们:

fit(X, y, sample_weight=None)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM