简体   繁体   中英

Why does Statsmodels OLS doesn't support reading in columns with multiple words?

I've been experimenting with Seaborn's lmplot() and Statsmodels .ols() functions for simple linear regression plots and their associated p-values, r-squared, etc.

I've noticed that when I specify which columns I want to use for lmplot, I can specify a column even if it has multiple words for it:

import seaborn as sns
import pandas as pd
input_csv = pd.read_csv('./test.csv',index_col = 0,header = 0)
input_csv

CSV图

sns.lmplot(x='Age',y='Count of Specific Strands',data = input_csv)
<seaborn.axisgrid.FacetGrid at 0x2800985b710>

在此输入图像描述

However, if I try to use ols, I'm getting an error for inputting in "Count of Specific Strands" as my dependent variable (I've only listed out the last couple of lines in the error):

import statsmodels.formula.api as smf
test_results = smf.ols('Count of Specific Strands ~ Age',data = input_csv).fit()

File "<unknown>", line 1
    Count of Specific Strands
           ^
SyntaxError: invalid syntax

Conversely, if I specify the "Counts of Specific Strand" as shown below, the regression works:

test_results = smf.ols('input_csv.iloc[:,1] ~ Age',data = input_csv).fit()
test_results.summary()

回归结果

Does anyone know why this is? Is it just because of how Statsmodels was written? Is there an alternative to specify the dependent variable for regression analysis that doesn't involve iloc or loc?

This is due to the way the formula parser patsy is written: see this link for more information

The authors of patsy have, however, thought of this problem: (quoted from here )

This flexibility does create problems in one case, though – because we interpret whatever you write in-between the + signs as Python code, you do in fact have to write valid Python code. And this can be tricky if your variable names have funny characters in them, like whitespace or punctuation. Fortunately, patsy has a builtin “transformation” called Q() that lets you “quote” such variables

Therefore, in your case, you should be able to write:

smf.ols('Q("Count of Specific Strands") ~ Age',data = input_csv).fit()

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM