简体   繁体   中英

When curve fitting with Python's statsmodels' OLS linear regression, how do I choose the constant in the formula?

  • I'd like to fit linear regression models of different degrees to a data set and choose the best fitting one based on adjusted r^2 .
  • Based on other answers , I'm using the OLS formula "y ~ 1 + " + " + ".join("I(x**{})".format(i) for i in range(1, degree+1)) ,
  • I don't have enough statistics knowledge to understand: is the 1 + constant needed and, if so, what should the constant value be?
import numpy
import pandas
import matplotlib
import matplotlib.offsetbox
import statsmodels.tools
import statsmodels.formula.api

data = numpy.array([
  [1999, 197.0],
  [2000, 196.5],
  [2001, 194.3],
  [2002, 193.7],
  [2003, 192.0],
  [2004, 189.2],
  [2005, 189.3],
  [2006, 187.6],
  [2007, 186.9],
  [2008, 186.0],
  [2009, 185.0],
  [2010, 186.2],
  [2011, 185.1],
  [2012, 185.6],
  [2013, 185.0],
  [2014, 185.6],
  [2015, 185.4],
  [2016, 185.1],
  [2017, 183.9],
])

df = pandas.DataFrame(data, columns=["Year", "CrudeRate"])

cause = "Malignant neoplasms"
x = df["Year"].values
y = df["CrudeRate"].values
degree = 2
predict_future_years = 5

# https://stackoverflow.com/a/34617603/4135310
olsdata = {"x": x, "y": y}
formula = "y ~ 1 + " + " + ".join("I(x**{})".format(i) for i in range(1, degree+1))
model = statsmodels.formula.api.ols(formula, olsdata).fit()

print(model.summary())

ax = df.plot("Year", "CrudeRate", kind="scatter", grid=True, title="Deaths from {}".format(cause))

# https://stackoverflow.com/a/37294651/4135310
func = numpy.poly1d(model.params.values[::-1])
matplotlib.pyplot.plot(df["Year"], func(df["Year"]))

predicted = func(df.Year.values[-1] + predict_future_years)
print("Predicted in {} years: {}".format(predict_future_years, predicted))

ax.add_artist(matplotlib.offsetbox.AnchoredText("$\\barR^2$ = {:0.2f}".format(model.rsquared_adj), loc="upper center"))
ax.add_artist(matplotlib.offsetbox.AnchoredText("Predicted in +{} = {:0.2f}".format(predict_future_years, predicted), loc="upper right"))

ax.xaxis.set_major_formatter(matplotlib.ticker.FormatStrFormatter("%d"))
fig = matplotlib.pyplot.gcf()
fig.autofmt_xdate(bottom=0.2, rotation=30, ha="right", which="both")
matplotlib.pyplot.tight_layout()
cleaned_title = cause.replace(" ", "_").replace("(", "").replace(")", "")
#matplotlib.pyplot.savefig("{}_{}.png".format(cleaned_title, degree), dpi=100)
matplotlib.pyplot.show()

数字

Based on comments from @ALollz, when using Patsy notation (eg statsmodels.formula.api.ols("y ~ x") ), you don't need to include 1 + , as the constant is added by default to the model, although this does not specify that your model has a constant that takes on the value of 1. Instead, it specifies that you have a constant whose magnitude will be given by the intercept coefficient. This is the constant determined by OLS, so it's the one you want.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM