简体   繁体   English

用于报价数据的线性回归 Model

[英]Linear Regression Model for Quote Data

I would like to build a linear regression model to determine the influence of various parameters on quote prices.我想建立一个线性回归 model 来确定各种参数对报价的影响。 The data of the quotes were collected over 10 years.报价数据收集了 10 多年。

过去 10 年的报价密度图

y = Price y = 价格

X = [System size(int),ZIP, Year, module_manufacturer, module_name, inverter_manufacturer,inverter_name, battery storage (binary), number of installers/offerer in the region(int), installer_density, new_construction(binary), self_installation(binary), household density] X = [系统大小(int),ZIP, Year, module_manufacturer, module_name,inverter_manufacturer,inverter_name, 电池存储(二进制), 区域内安装者/提供者的数量(int), installer_density, new_construction(binary), self_installation(binary) , 户口密度]

Questions:问题:

  1. What type of regression model is suitable for this dataset?什么类型的回归 model 适合这个数据集?
  2. Due to technological progress, quote prices decrease over years.由于技术进步,报价逐年下降。 How can I account for the different years in the model?如何计算 model 中的不同年份? I found some examples where years where considered as binary variables.我发现了一些将年份视为二进制变量的示例。 Another option: multiple regression models for each year.另一种选择:每年的多个回归模型。 Is there a way to combine these multiple models?有没有办法组合这些多个模型?
  3. Is the dataset a type of panel data?数据集是面板数据的一种吗?

Unfortunately, I have not yet found any information that could explicitly help me with my data.不幸的是,我还没有找到任何可以明确帮助我处理数据的信息。 But maybe I didn't use the right search terms.但也许我没有使用正确的搜索词。 I would be very happy about any suggestions that nudge me in the right direction.任何能推动我朝着正确方向前进的建议,我都会非常高兴。

Suppose you have a data.frame called data with columns price, system_size, zip, year, battery_storage etc. Then you can start with a simple linear regression:假设您有一个名为datadata.frame ,其中包含 price、system_size、zip、year、battery_storage 等列。然后您可以从简单的线性回归开始:

lm(price ~ system_size + zip + year + battery_storage, data = data)

year is included in the model so you take changes over time into account. year包含在 model 中,因此您可以考虑随时间的变化。 If you want to remove batch effects (eg different regions zip codes) and you just care to model the price after getting rid of the effect of different locations, you can run a linear mixed model:如果你想去除批量影响(例如不同区域的zip代码)并且你只关心model去除不同位置的影响后的价格,你可以运行线性混合model

lmerTest::lmer(price ~ system_size + year + battery_storage + (1|zip), data = data)

If you have a high correlation eg between year and system_size, you might want to include interaction terms like year:system_size into your formula.如果您有很高的相关性,例如 year 和 system_size 之间的相关性,您可能希望在公式中包含诸如year:system_size类的交互项。 As a rule of thumb, you need to have 10 samples for each variable to get a reasonable fit.根据经验,每个变量需要 10 个样本才能获得合理的拟合。 If you have more, you can do a variable selection first.如果你有更多,你可以先做一个变量选择。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM