简体   繁体   English

使用 sklearn RandomForestRegressor 时,我的数据帧的 x 值是多少?

[英]What is my dataframe's x value when using sklearn RandomForestRegressor?

I'm working on a big data project for my school project.我正在为我的学校项目开展一个大数据项目。 My dataset looks like this: https://github.com/gindeleo/climate/blob/master/GlobalTemperatures.csv我的数据集如下所示: https : //github.com/gindeleo/climate/blob/master/GlobalTemperatures.csv

I'm trying to predict the next values of "LandAverageTemperature".我正在尝试预测“LandAverageTemperature”的下一个值。

I've asked another question about this topic earlier.(its here: How to predict correctly in sklearn RandomForestRegressor? ) I couldn't get any answer for that question.After not getting anything in my first question and then failing for another day, I've decided to start from scratch.我之前已经问过关于这个话题的另一个问题。(它在这里: 如何在 sklearn RandomForestRegressor 中正确预测? )我无法得到这个问题的任何答案。在我的第一个问题中没有得到任何东西然后又失败了一天之后,我决定从头开始。

Right now, I want to know which value is in my dataset is "x" to make the prediction correctly.现在,我想知道我的数据集中哪个值是“x”以正确进行预测。 I read that y is a dependent variable which that I want to predict and x is the independent variable that I should use as "predictor" to help the prediction proccess.我读到 y 是我想要预测的因变量,而 x 是我应该用作“预测器”来帮助预测过程的自变量。 In that case my y variable is "LandAverageTemperature".在这种情况下,我的 y 变量是“LandAverageTemperature”。 I don't know what the x value is.我不知道 x 值是什么。 I was using date values for x at first but I'm not sure that is true at the moment.我起初使用 x 的日期值,但我不确定目前是否如此。

And if I shouldn't use RandomForestRegressor or sklearn (I've started with spark to this project) for this dataset please let me know.如果我不应该对这个数据集使用 RandomForestRegressor 或 sklearn(我已经开始使用 spark 到这个项目),请告诉我。 Thanks in advance.提前致谢。

You only have one variable ( LandAverageTemperature ), so obviously that's what you're going to use.您只有一个变量 ( LandAverageTemperature ),所以很明显这就是您要使用的。 What you're looking for is the df.shift() function, which shifts your values.您正在寻找的是df.shift()函数,它会改变您的值。 With this function, you'll be able to add columns of past values to your dataframe .使用此功能,您将能够将过去值的列添加到您的dataframe You will then be able to use t 1 month/day ago , t 2 months/days ago , etc, as predictors of another day/month's temperature.然后,您将能够使用t 1 month/day agot 2 months/days ago等作为另一天/月温度的预测值。

You can use it like this:你可以这样使用它:

for i in range(1, 15):
    df.loc[:, 'T–%s'%i] = df.loc[:, 'LandAverageTemperature'].shift(i)

Your columns will then be temperature , and temperature at T-1 , T-2 , for up to 14 time periods.然后,您的列将是temperature和 temperature at T-1T-2 ,最多 14 个时间段。

For your question about what is a proper model for time series forecasting, it would be off-topic for this site, but many resources exist on https://stats.stackexchange.com .对于您关于什么是时间序列预测的合适模型的问题,这将是本网站的题外话,但https://stats.stackexchange.com 上存在许多资源。

In general case you can use for X feature matrix all data columns excluding your target column .在一般情况下,您可以对 X 特征矩阵使用除目标列之外的所有数据列 But in your case there is several complications:但在你的情况下有几个并发症:

  • You have missed (empty) data in most of the columns for many years.多年来,您在大多数列中都遗漏了(空)数据 You can exclude such rows/years from train data.您可以从训练数据中排除此类行/年。 Or exclude columns with missed data (which will be almost all of your columns and it's not good).或者排除丢失数据的列(这将是您几乎所有的列,这并不好)。
  • Regression model can't use date fields directly, you should traislate date field to some numerical field(s), "months past first observation", for example.回归模型不能直接使用日期字段,您应该将日期字段转换为一些数字字段,例如“第一次观察后的月份”。 Something like (year-1750)*12 + month .类似于(year-1750)*12 + month Or/and you can have year and month in separate columns (it's better if you have some "seasonality" in your data).或者/并且您可以在单独的列中包含年份和月份(如果您的数据中有一些“季节性”会更好)。
  • You have sequental time data here, so may be you should not use simple regression.您在这里有连续的时间数据,因此您可能不应该使用简单回归。 Use some ARIMA/SARIMA/SARIMAX and so on so-called Time-Series models which predicts target data sequentially one value by another, month after month in your case.使用一些 ARIMA/SARIMA/SARIMAX 等所谓的时间序列模型,在您的情况下,逐月逐月预测目标数据。 It's a hard topic for learning, but you should definitely take a look at TS because you will need it some time in the future if not today.这是一个很难学习的话题,但你绝对应该看看 TS,因为如果不是今天,你将来会需要它。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM