[英]Creating a regression model using Day of Week, Hour of Day, and Type of Media?
Working with Python 3 in a Jupyter
notebook. 在Jupyter
笔记本中使用Python 3。 I am trying to create a regression model (equation?) to predict the Eng as % of Followers
variable. 我正在尝试创建一个回归模型(等式)以预测Eng as % of Followers
变量的Eng as % of Followers
。 I'd be given Media Type
, Hour Created
, and Day of Week
. 我将获得Media Type
, Hour Created
和Day of Week
。 These should all be treated as categorical variables. 这些都应视为分类变量。
Here is some of the past data I have. 这是我过去的一些数据。
Media Type Eng as % of Followers Hour Created Day of Week
0 Video 0.0136 23 Tuesday
1 Video 0.0163 22 Wednesday
2 Video 0.0163 22 Tuesday
3 Video 0.0196 22 Friday
4 Video 0.0179 20 Thursday
5 Photo 0.0087 14 Wednesday
I've created dummy variables
using pd.get_dummies
, but I'm not sure I did that correctly - the problem specifically lies with the Hour Created
variable. 我已经使用pd.get_dummies
创建了dummy variables
,但是我不确定我是否正确执行了-问题特别在于Hour Created
变量。 They're numbers, but I want them treated as categories. 它们是数字,但我希望将它们视为类别。 For example, Hour 22 might be a performance booster, but that shouldn't imply anything about Hours 21 or 23. 例如,Hour 22可能会提高性能,但这并不意味着有关21或23小时。
I'm also curious if I could have my model factor in the interaction between Day of Week
and Hour Created
(maybe Hour 22 is a boost on most days, but 22-Friday causes a dip) like I've seen done with patsy... but that might be me getting greedy. 我也很好奇我是否可以在“ Day of Week
和“ Hour Created
之间的交互中使用模型因素(也许在大多数日子里22小时会增加,但22-星期五会导致下降),就像我看到的patsy一样。 ..但这可能是我变得贪婪。
Here is how I created my dummy variables, which sets me up for the issue of having Hour Created
as a quantitative variable, instead of qualitative. 这是我创建虚拟变量的方式,这使我可以将Hour Created
作为定量变量而不是定性变量。 Also, the Vars dataframe that I'd use going forward now doesn't have the very thing that I'm trying to predict. 另外,我现在要使用的Vars数据框还没有我要预测的东西。 Could that possibly be right? 可能是正确的吗?
Vars = Training[['Hour Created','Day of Week','Media Type']]
Result = Training['Eng as % of Followers']
Vars = pd.get_dummies(data=Vars, drop_first=True)
If someone could help with the Hour Created problem, that would be a great start.... And then, not sure where to go from there. 如果有人可以解决“小时创造”的问题,那将是一个不错的开始。...然后,不确定从那里去哪里。 I've seen people use the ols function in this situation. 我已经看到人们在这种情况下使用ols函数。 Or linear_model from sklearn. 或来自sklearn的linear_model。 I'm struggling with how to interpret the results from either, and especially struggling with how I'd plug a dataframe of those 3 independent variables into that model. 我在如何解释其中任何一个的结果上苦苦挣扎,尤其是在如何将这3个独立变量的数据框插入该模型的过程中苦苦挣扎。 If someone can make a suggestion, I'll try to run with it. 如果有人可以提出建议,我会尝试解决。
Edit: Including a couple of ways I tried to create this model. 编辑:包括我尝试创建此模型的几种方法。 Here's the first, which I assume is using my Hour data incorrectly. 这是第一个,我假设使用的小时数据不正确。 And being that the dataframe I'm passing into it doesn't even have Eng as % of Followers as a column header, I'm not even sure what it's trying to predict... 而且由于我要传递给它的数据帧甚至没有Eng作为跟随者的百分比作为列标题,所以我什至不确定它要预测什么。
Vars_train, Vars_test, Result_train, Result_test = train_test_split(Vars, Result, test_size = .20, random_state = 40)
regr = linear_model.LinearRegression()
regr.fit(Vars_train, Result_train)
predicted = regr.predict(Vars_test)
When I try to use the ols method as follows, I get an invalid syntax error. 当我尝试如下使用ols方法时,出现无效的语法错误。 I've tried different variations to no avail. 我尝试了不同的变体但无济于事。
fit1 = ols('Eng as % of Followers ~ C(Day of Week) + C(Hour Created) + C(Media Type)', data=Training).fit()
One way to make sure that you are doing dummy coding correctly is to convert the columns to str
types. 确保正确进行伪编码的一种方法是将列转换为str
类型。 In your case you want consider Hour Created
as categorical though it is numeric in nature, so it is better to convert them to strings before doing dummy coding. 在您的情况下,尽管本质上是数字型的,但您希望将“ Hour Created
视为类别,因此最好在进行虚拟编码之前将它们转换为字符串。
In order to capture interaction between Day of Week
and Hour Created
do a feature engineering and create your own feature by multiplying Day of Week
and Hour Created
and feed it as an input to your model. 为了捕获Day of Week
和Hour Created
之间的交互作用,请进行要素工程设计,并通过乘以Day of Week
和Hour Created
自己的要素,并将其作为模型的输入。
In order to understand/interpret your model you can look at the weights/coefficients of different features which gives an idea of how each and every feature impacts your target variable positively or negatively. 为了理解/解释您的模型,您可以查看不同特征的权重/系数,从而了解每个特征如何正面或负面地影响目标变量。
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
df
Media Type Eng_as_%_of_Followers Hour_Created Day_of_Week
0 0 Video 0.0136 23 Tuesday
1 1 Video 0.0163 22 Wednesday
2 2 Video 0.0163 22 Tuesday
3 3 Video 0.0196 22 Friday
4 4 Video 0.0179 20 Thursday
5 5 Photo 0.0087 14 Wednesday
df["Hour_Created"] = df["Hour_Created"].astype(str)
df["Interaction"] = df["Hour_Created"] + "_" +df["Day_of_Week"]
X = df.drop("Eng_as_%_of_Followers", axis=1)
Y = df["Eng_as_%_of_Followers"]
X_encoded = pd.get_dummies(X)
X_train, X_test, y_train, y_test = train_test_split(
X_encoded, Y, test_size=0.33, random_state=42)
reg = LinearRegression().fit(X_train, y_train)
coef_dict = dict(zip(X_encoded.columns, reg.coef_))
coef_dict
{'Day_of_Week_Friday': 0.0012837455830388678,
'Day_of_Week_Thursday': 0.0007424028268551229,
'Day_of_Week_Tuesday': -0.0008084805653710235,
'Day_of_Week_Wednesday': -0.0012176678445229678,
'Hour_Created_14': -0.0012176678445229678,
'Hour_Created_20': 0.0007424028268551229,
'Hour_Created_22': 0.0004752650176678456,
'Hour_Created_23': 0.0,
'Interaction_14_Wednesday': -0.0012176678445229678,
'Interaction_20_Thursday': 0.0007424028268551229,
'Interaction_22_Friday': 0.0012837455830388678,
'Interaction_22_Tuesday': -0.0008084805653710235,
'Interaction_22_Wednesday': 0.0,
'Interaction_23_Tuesday': 0.0,
'Media': -0.0008844522968197866,
'Type_Photo': -0.0012176678445229708,
'Type_Video': 0.0012176678445229685}
Of course the results may not be really interesting here, coz I was just working with 6 data points. 当然,这里的结果可能并不是很有趣,因为我只使用了6个数据点。
Answering your questions 回答你的问题
You can find out the y_intercept
using reg.intercept_
你可以找出y_intercept
使用reg.intercept_
Yes you can plug in new values for x and get your target variable by using reg.predict(x)
, where x is your new input. 是的,您可以使用reg.predict(x)
插入x的新值并获取目标变量,其中x是您的新输入。
Regression done by OLS
and sklearn
are one and the same . OLS
和sklearn
进行的回归是相同的 。 OLS is nothing but a way to solve the optimization problem which we have in regression. OLS只是解决我们回归中的优化问题的一种方法。
Hope this helps! 希望这可以帮助!
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.