简体   繁体   English

使用星期几,一天中的小时和媒体类型创建回归模型?

[英]Creating a regression model using Day of Week, Hour of Day, and Type of Media?

Working with Python 3 in a Jupyter notebook. Jupyter笔记本中使用Python 3。 I am trying to create a regression model (equation?) to predict the Eng as % of Followers variable. 我正在尝试创建一个回归模型(等式)以预测Eng as % of Followers变量的Eng as % of Followers I'd be given Media Type , Hour Created , and Day of Week . 我将获得Media TypeHour CreatedDay of Week These should all be treated as categorical variables. 这些都应视为分类变量。

Here is some of the past data I have. 这是我过去的一些数据。

    Media Type  Eng as % of Followers   Hour Created    Day of Week
0   Video   0.0136  23  Tuesday
1   Video   0.0163  22  Wednesday
2   Video   0.0163  22  Tuesday
3   Video   0.0196  22  Friday
4   Video   0.0179  20  Thursday
5   Photo   0.0087  14  Wednesday

I've created dummy variables using pd.get_dummies , but I'm not sure I did that correctly - the problem specifically lies with the Hour Created variable. 我已经使用pd.get_dummies创建了dummy variables ,但是我不确定我是否正确执行了-问题特别在于Hour Created变量。 They're numbers, but I want them treated as categories. 它们是数字,但我希望将它们视为类别。 For example, Hour 22 might be a performance booster, but that shouldn't imply anything about Hours 21 or 23. 例如,Hour 22可能会提高性能,但这并不意味着有关21或23小时。

I'm also curious if I could have my model factor in the interaction between Day of Week and Hour Created (maybe Hour 22 is a boost on most days, but 22-Friday causes a dip) like I've seen done with patsy... but that might be me getting greedy. 我也很好奇我是否可以在“ Day of Week和“ Hour Created之间的交互中使用模型因素(也许在大多数日子里22小时会增加,但22-星期五会导致下降),就像我看到的patsy一样。 ..但这可能是我变得贪婪。

Here is how I created my dummy variables, which sets me up for the issue of having Hour Created as a quantitative variable, instead of qualitative. 这是我创建虚拟变量的方式,这使我可以将Hour Created作为定量变量而不是定性变量。 Also, the Vars dataframe that I'd use going forward now doesn't have the very thing that I'm trying to predict. 另外,我现在要使用的Vars数据框还没有我要预测的东西。 Could that possibly be right? 可能是正确的吗?

Vars = Training[['Hour Created','Day of Week','Media Type']]
Result = Training['Eng as % of Followers']
Vars = pd.get_dummies(data=Vars, drop_first=True)

If someone could help with the Hour Created problem, that would be a great start.... And then, not sure where to go from there. 如果有人可以解决“小时创造”的问题,那将是一个不错的开始。...然后,不确定从那里去哪里。 I've seen people use the ols function in this situation. 我已经看到人们在这种情况下使用ols函数。 Or linear_model from sklearn. 或来自sklearn的linear_model。 I'm struggling with how to interpret the results from either, and especially struggling with how I'd plug a dataframe of those 3 independent variables into that model. 我在如何解释其中任何一个的结果上苦苦挣扎,尤其是在如何将这3个独立变量的数据框插入该模型的过程中苦苦挣扎。 If someone can make a suggestion, I'll try to run with it. 如果有人可以提出建议,我会尝试解决。

Edit: Including a couple of ways I tried to create this model. 编辑:包括我尝试创建此模型的几种方法。 Here's the first, which I assume is using my Hour data incorrectly. 这是第一个,我假设使用的小时数据不正确。 And being that the dataframe I'm passing into it doesn't even have Eng as % of Followers as a column header, I'm not even sure what it's trying to predict... 而且由于我要传递给它的数据帧甚至没有Eng作为跟随者的百分比作为列标题,所以我什至不确定它要预测什么。

Vars_train, Vars_test, Result_train, Result_test = train_test_split(Vars, Result, test_size = .20, random_state = 40)
regr = linear_model.LinearRegression() 
regr.fit(Vars_train, Result_train)
predicted = regr.predict(Vars_test)

When I try to use the ols method as follows, I get an invalid syntax error. 当我尝试如下使用ols方法时,出现无效的语法错误。 I've tried different variations to no avail. 我尝试了不同的变体但无济于事。

fit1 = ols('Eng as % of Followers ~ C(Day of Week) + C(Hour Created) + C(Media Type)', data=Training).fit() 
  1. One way to make sure that you are doing dummy coding correctly is to convert the columns to str types. 确保正确进行伪编码的一种方法是将列转换为str类型。 In your case you want consider Hour Created as categorical though it is numeric in nature, so it is better to convert them to strings before doing dummy coding. 在您的情况下,尽管本质上是数字型的,但您希望将“ Hour Created视为类别,因此最好在进行虚拟编码之前将它们转换为字符串。

  2. In order to capture interaction between Day of Week and Hour Created do a feature engineering and create your own feature by multiplying Day of Week and Hour Created and feed it as an input to your model. 为了捕获Day of WeekHour Created之间的交互作用,请进行要素工程设计,并通过乘以Day of WeekHour Created自己的要素,并将其作为模型的输入。

  3. In order to understand/interpret your model you can look at the weights/coefficients of different features which gives an idea of how each and every feature impacts your target variable positively or negatively. 为了理解/解释您的模型,您可以查看不同特征的权重/系数,从而了解每个特征如何正面或负面地影响目标变量。

import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

df 

Media   Type    Eng_as_%_of_Followers   Hour_Created    Day_of_Week
0   0   Video   0.0136                  23              Tuesday
1   1   Video   0.0163                  22              Wednesday
2   2   Video   0.0163                  22              Tuesday
3   3   Video   0.0196                  22              Friday
4   4   Video   0.0179                  20              Thursday
5   5   Photo   0.0087                  14              Wednesday 

df["Hour_Created"] = df["Hour_Created"].astype(str)
df["Interaction"] = df["Hour_Created"] + "_" +df["Day_of_Week"] 

X = df.drop("Eng_as_%_of_Followers", axis=1)
Y = df["Eng_as_%_of_Followers"]

X_encoded = pd.get_dummies(X)

X_train, X_test, y_train, y_test = train_test_split(
    X_encoded, Y, test_size=0.33, random_state=42)

reg = LinearRegression().fit(X_train, y_train)

coef_dict = dict(zip(X_encoded.columns, reg.coef_))

coef_dict

{'Day_of_Week_Friday': 0.0012837455830388678,
 'Day_of_Week_Thursday': 0.0007424028268551229,
 'Day_of_Week_Tuesday': -0.0008084805653710235,
 'Day_of_Week_Wednesday': -0.0012176678445229678,
 'Hour_Created_14': -0.0012176678445229678,
 'Hour_Created_20': 0.0007424028268551229,
 'Hour_Created_22': 0.0004752650176678456,
 'Hour_Created_23': 0.0,
 'Interaction_14_Wednesday': -0.0012176678445229678,
 'Interaction_20_Thursday': 0.0007424028268551229,
 'Interaction_22_Friday': 0.0012837455830388678,
 'Interaction_22_Tuesday': -0.0008084805653710235,
 'Interaction_22_Wednesday': 0.0,
 'Interaction_23_Tuesday': 0.0,
 'Media': -0.0008844522968197866,
 'Type_Photo': -0.0012176678445229708,
 'Type_Video': 0.0012176678445229685}

Of course the results may not be really interesting here, coz I was just working with 6 data points. 当然,这里的结果可能并不是很有趣,因为我只使用了6个数据点。

Answering your questions 回答你的问题

  1. You can find out the y_intercept using reg.intercept_ 你可以找出y_intercept使用reg.intercept_

  2. Yes you can plug in new values for x and get your target variable by using reg.predict(x) , where x is your new input. 是的,您可以使用reg.predict(x)插入x的新值并获取目标变量,其中x是您的新输入。

  3. Regression done by OLS and sklearn are one and the same . OLSsklearn进行的回归是相同的 OLS is nothing but a way to solve the optimization problem which we have in regression. OLS只是解决我们回归中的优化问题的一种方法。

Hope this helps! 希望这可以帮助!

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM