简体   繁体   中英

statsmodels linear regression model doesn't work because of the "invalid syntax"

I would like to use statsmodels linear regression model, but I have a problem: I get the nex error:

Traceback (most recent call last):
  File "C:\Users\aleks\PycharmProjects\statistics\econometrics.py", line 95, in <module>
    lr = sm.OLS.from_formula('rj13.2 ~ age+C(rh5)+C(r_diplom)+C(status)+C(rh6)+C(rj1.1.1)',df_stats_models).fit()
  File "C:\Users\aleks\PycharmProjects\statistics\venv\lib\site-packages\statsmodels\base\model.py", line 200, in from_formula
    tmp = handle_formula_data(data, None, formula, depth=eval_env,
  File "C:\Users\aleks\PycharmProjects\statistics\venv\lib\site-packages\statsmodels\formula\formulatools.py", line 63, in handle_formula_data
    result = dmatrices(formula, Y, depth, return_type='dataframe',
  File "C:\Users\aleks\PycharmProjects\statistics\venv\lib\site-packages\patsy\highlevel.py", line 309, in dmatrices
    (lhs, rhs) = _do_highlevel_design(formula_like, data, eval_env,
  File "C:\Users\aleks\PycharmProjects\statistics\venv\lib\site-packages\patsy\highlevel.py", line 164, in _do_highlevel_design
    design_infos = _try_incr_builders(formula_like, data_iter_maker, eval_env,
  File "C:\Users\aleks\PycharmProjects\statistics\venv\lib\site-packages\patsy\highlevel.py", line 66, in _try_incr_builders
    return design_matrix_builders([formula_like.lhs_termlist,
  File "C:\Users\aleks\PycharmProjects\statistics\venv\lib\site-packages\patsy\build.py", line 689, in design_matrix_builders
    factor_states = _factors_memorize(all_factors, data_iter_maker, eval_env)
  File "C:\Users\aleks\PycharmProjects\statistics\venv\lib\site-packages\patsy\build.py", line 354, in _factors_memorize
    which_pass = factor.memorize_passes_needed(state, eval_env)
  File "C:\Users\aleks\PycharmProjects\statistics\venv\lib\site-packages\patsy\eval.py", line 474, in memorize_passes_needed
    subset_names = [name for name in ast_names(self.code)
  File "C:\Users\aleks\PycharmProjects\statistics\venv\lib\site-packages\patsy\eval.py", line 474, in <listcomp>
    subset_names = [name for name in ast_names(self.code)
  File "C:\Users\aleks\PycharmProjects\statistics\venv\lib\site-packages\patsy\eval.py", line 105, in ast_names
    for node in ast.walk(ast.parse(code)):
  File "C:\Users\aleks\AppData\Local\Programs\Python\Python39\lib\ast.py", line 50, in parse
    return compile(source, filename, mode, flags,
  File "<unknown>", line 1
    C(rj1 .1 .1)
          ^
SyntaxError: invalid syntax

My code:

lr = sm.OLS.from_formula('rj13.2 ~ age+C(rh5)+C(r_diplom)+C(status)+C(rh6)+C(rj1.1.1)',df_stats_models).fit()
print(lr.summary())

df_stats_models.head() looks like that:

Index(['rj13.2', 'rh6', 'rh5', 'r_diplom', 'status', 'rj1.1.1', 'age'], dtype='object')
      rj13.2     rh6      rh5  ...           status                  rj1.1.1   age
46   30000.0  1986.0  МУЖСКОЙ  ...  областной центр  ПОЛНОСТЬЮ УДОВЛЕТВОРЕНЫ  27.0
178  22000.0  1992.0  МУЖСКОЙ  ...            город     СКОРЕЕ УДОВЛЕТВОРЕНЫ  21.0
271  10200.0  1964.0  ЖЕНСКИЙ  ...            город     СКОРЕЕ УДОВЛЕТВОРЕНЫ  49.0
537   6000.0  1952.0  ЖЕНСКИЙ  ...            город     СКОРЕЕ УДОВЛЕТВОРЕНЫ  61.0
538  13000.0  1964.0  ЖЕНСКИЙ  ...            город     СКОРЕЕ УДОВЛЕТВОРЕНЫ  49.0

Why does it get angry at C(rj1.1.1)?

To read R-style formulas, statsmodels use the patsy package whose parser does not like special characters (like. or -) in the variable names. To "protect" such names, you can use the Q() function (with double quotes for the formula):

lr = sm.OLS.from_formula("Q('rj13.2') ~ age+C(rh5)+C(r_diplom)+C(status)+C(rh6)+C(Q('rj1.1.1'))", df_stats_models).fit()

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM