简体   繁体   中英

Using ols function with parameters that contain numbers/spaces

I am having a lot of difficulty using the statsmodels.formula.api function

       ols(formula,data).fit().rsquared_adj 

due to the nature of the names of my predictors. The predictors have numbers and spaces etc in them which it clearly doesn't like. I understand that I need to use something like patsy.builtins.Q So lets say my predictor would be weight.in.kg , it should be entered as follows:

Q("weight.in.kg")

so I need to take my formula from a list, and the difficulty arises in modifying every item in the list with this patsy.builtin.Q

formula = "{} ~ {} + 1".format(response, ' + '.join([candidate])

with [candidate] being my list of predictors.

My question to you, dearest python experts, is how on earth do I put every individual item in the list [candidate] within the quotes in the following expression:

Q('')

so that the ols function can actually read it? Apologies if this is super obvious, me no good at python.

Right now you're starting with a list of terms that you want in your formula, then trying to paste them together into a complicated string, which patsy will parse and convert back into a list of terms. You can see the data structure that patsy generates for this kind of formula ( ModelDesc.from_formula is patsy's parser):

In [7]: from patsy import ModelDesc

In [8]: ModelDesc.from_formula("y ~ x1 + x2 + x3")
Out[8]: 
ModelDesc(lhs_termlist=[Term([EvalFactor('y')])],
          rhs_termlist=[Term([]),
                        Term([EvalFactor('x1')]),
                        Term([EvalFactor('x2')]),
                        Term([EvalFactor('x3')])])

This might look a little intimidating, but it's pretty simple really -- you have a ModelDesc , which represents a single formula, and it has a left-hand-side list of terms and a right-hand-side list of terms. Each term is represented by a Term object, and each Term has a list of factors. (Here each term just has a single factor -- if you had any interactions then those terms would have multiple factors.) Also, the "empty interaction" Term([]) is how patsy represents the intercept term.

So you can avoid all this complicated quoting/parsing stuff by directly creating the terms you want and passing them to patsy, skipping the string parsing step

from patsy import ModelDesc, Term, LookupFactor

response_terms = [Term([LookupFactor(response)])]
# start with intercept...
model_terms = [Term([])]
# ...then add another term for each candidate
model_terms += [Term([LookupFactor(c)]) for c in candidates]
model_desc = ModelDesc(response_terms, model_terms)

and now you can pass that model_desc object into any function where you'd normally pass a patsy formula:

ols(model_desc, data).fit().rsquared_adj

There's another trick here: you'll notice that the first example has EvalFactor objects, and now we're using LookupFactor objects instead. The difference is that EvalFactor takes a string of arbitrary Python code, which is nice if you want to write something like np.log(x1) , but really annoying if you have variables with name like weight.in.kg . LookupFactor directly takes the name of a variable to look up in your data, so no further quoting is needed.

Alternatively, you could do this with some fancier Python string processing, like:

quoted = ["Q('{}')".format(c) for c in candidates]
formula = "{} ~ {} + 1".format(response, ' + '.join(quoted))

But while this is a bit simpler to start with, it's much more fragile -- for example, think about (or try) what happens if one of your parameters contains a quote character! You should never write something like this in a processing pipeline where the candidate names come from somewhere else that you can't control (eg a random CSV file) -- you could get all kinds of arbitrary code execution. The solution above avoids all of these problems.

Reference:

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM