简体   繁体   English

为什么Statsmodels OLS不支持读取包含多个单词的列?

[英]Why does Statsmodels OLS doesn't support reading in columns with multiple words?

I've been experimenting with Seaborn's lmplot() and Statsmodels .ols() functions for simple linear regression plots and their associated p-values, r-squared, etc. 我一直在尝试使用Seaborn的lmplot()和Statsmodels .ols()函数来处理简单的线性回归图及其相关的p值,r平方等。

I've noticed that when I specify which columns I want to use for lmplot, I can specify a column even if it has multiple words for it: 我注意到当我指定要用于lmplot的列时,我可以指定一个列,即使它有多个单词:

import seaborn as sns
import pandas as pd
input_csv = pd.read_csv('./test.csv',index_col = 0,header = 0)
input_csv

CSV图

sns.lmplot(x='Age',y='Count of Specific Strands',data = input_csv)
<seaborn.axisgrid.FacetGrid at 0x2800985b710>

在此输入图像描述

However, if I try to use ols, I'm getting an error for inputting in "Count of Specific Strands" as my dependent variable (I've only listed out the last couple of lines in the error): 但是,如果我尝试使用ols,我输入“特定链的数量”作为我的因变量时输入错误(我只列出了错误中的最后几行):

import statsmodels.formula.api as smf
test_results = smf.ols('Count of Specific Strands ~ Age',data = input_csv).fit()

File "<unknown>", line 1
    Count of Specific Strands
           ^
SyntaxError: invalid syntax

Conversely, if I specify the "Counts of Specific Strand" as shown below, the regression works: 相反,如果我指定如下所示的“特定链的计数”,则回归起作用:

test_results = smf.ols('input_csv.iloc[:,1] ~ Age',data = input_csv).fit()
test_results.summary()

回归结果

Does anyone know why this is? 有人知道为什么吗? Is it just because of how Statsmodels was written? 是不是因为Statsmodels是如何写的? Is there an alternative to specify the dependent variable for regression analysis that doesn't involve iloc or loc? 是否有替代方法可以为不涉及iloc或loc的回归分析指定因变量?

This is due to the way the formula parser patsy is written: see this link for more information 这是由于编写公式解析器patsy的方式: 有关详细信息,请参阅此链接

The authors of patsy have, however, thought of this problem: (quoted from here ) 然而, patsy的作者想到了这个问题:(引自这里

This flexibility does create problems in one case, though – because we interpret whatever you write in-between the + signs as Python code, you do in fact have to write valid Python code. 这种灵活性确实会在一种情况下产生问题 - 因为我们将您在+符号之间写的任何内容解释为Python代码,实际上您必须编写有效的Python代码。 And this can be tricky if your variable names have funny characters in them, like whitespace or punctuation. 如果您的变量名称中包含有趣的字符(如空格或标点符号),这可能会非常棘手。 Fortunately, patsy has a builtin “transformation” called Q() that lets you “quote” such variables 幸运的是,patsy有一个名为Q()的内置“转换”,可以让你“引用”这些变量

Therefore, in your case, you should be able to write: 因此,在您的情况下,您应该能够写:

smf.ols('Q("Count of Specific Strands") ~ Age',data = input_csv).fit()

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM