简体   繁体   English

为时间序列预测预处理数据

[英]Preprocessing data for Time-Series prediction

Okay, so I am doing research on how to do Time-Series Prediction.好的,所以我正在研究如何进行时间序列预测。 Like always, it's preprocessing the data that's the difficult part.像往常一样,预处理数据是困难的部分。 I get I have to convert the "time-stamp" in a data file into a "datetime" or "timestep" I did that.我知道我必须将数据文件中的“时间戳”转换为“日期时间”或“时间步长”。

df = pd.read_csv("airpassengers.csv")
month = pd.to_datatime(df['Month'])

(I may have parse the datatime incorrectly, I seen people use pd.read_csv() instead to parse the data. If I do, please advise on how to do it properly) (我可能错误地解析了数据时间,我看到人们使用pd.read_csv()来解析数据。如果我这样做了,请告知如何正确执行)

I also understand the part where I scale my data.我也了解我扩展数据的部分。 (Could someone explain to me how the scaling works, I know that it turns all my data within the range I give it, but would the output of my prediction also be scaled or something.) (有人可以向我解释缩放是如何工作的,我知道它会将我的所有数据都转换在我给它的范围内,但是我预测的 output 也会被缩放或其他什么。)

Lastly, once I have scaled and parsed data and timestamps, how would I actually predict with the trained model.最后,一旦我对数据和时间戳进行了缩放和解析,我将如何使用经过训练的 model 进行实际预测。 I don't know what to enter into (for example) model.predict() I did some research it seemed like I have to shift my dataset or something, I don't really understand what the documentation is saying.我不知道要输入什么(例如) model.predict()我做了一些研究,好像我必须改变我的数据集或其他东西,我不太明白文档在说什么。 And the example isn't directly related to time-series prediction.该示例与时间序列预测没有直接关系。

I know this is a lot, you might now be able to answer all the questions.我知道这很多,您现在可能能够回答所有问题。 I am fairly new to this.我对此很陌生。 Just help with whatever you can.尽你所能提供帮助。 Thank you!谢谢!

So, because you're working with airpassengers.csv and asking about predictive modeling I'm going to assume you're working through this github因此,因为您正在使用 airpassengers.csv 并询问预测建模,我假设您正在处理此 github

There's a couple of things I want to make sure you know before I dive into the answer to your questions.在我深入回答您的问题之前,我想确保您知道几件事。

  • There are lots of different types of predictive models used in forecasting.预测中使用了许多不同类型的预测模型。 You can find all about them here你可以在这里找到所有关于它们的信息
  • You're asking a lot of broad questions but I'll break down the main questions into two steps and describe what's happening using the example that I believe you're trying to replicate您提出了很多广泛的问题,但我会将主要问题分为两个步骤,并使用我相信您正在尝试复制的示例来描述正在发生的事情

Let's break it down让我们分解一下

Loading and parsing the data加载和解析数据

import pandas as pd
import numpy as np
import matplotlib.pylab as plt
%matplotlib inline
from matplotlib.pylab import rcParams
rcParams['figure.figsize'] = 15, 6
air_passengers = pd.read_csv("./data/AirPassengers.csv", header = 0, parse_dates = [0], names = ['Month', 'Passengers'], index_col = 0)

This section of code loads in the data from a.csv (comma-separated values) file.这部分代码从 a.csv(逗号分隔值)文件加载数据。 It's saved into the data frame air_passengers.它保存在数据框 air_passengers 中。 Inside the function to read in the csv we also state that there's a header in the first row, the first column is full of dates, the name of our columns is assigned, we index our data frame to the first column. Inside the function to read in the csv we also state that there's a header in the first row, the first column is full of dates, the name of our columns is assigned, we index our data frame to the first column.

Scaling the data缩放数据

log_air_passengers = np.log(air_passengers.Passengers)

This is done to make the math make sense.这样做是为了使数学有意义。 Logs are the inverse of exponents (X^2 is the same as Log2X).对数是指数的倒数(X^2 与 Log2X 相同)。 Using numpy's log function it gives us the natural log (log e).使用 numpy 的日志 function 它给了我们自然对数(log e)。 This is also called the natural log.这也称为自然对数。 Your predicted values will actually be so close to a percent change that you can use them as such您的预测值实际上将非常接近百分比变化,您可以这样使用它们

Now that the data has been scaled, we can prep it for statistical modeling现在数据已经被缩放,我们可以为统计建模做准备

log_air_passengers_diff = log_air_passengers - log_air_passengers.shift()
log_air_passengers_diff.dropna(inplace=True)

This changes the data frame to be the difference between the previous and next data points instead of just the log values themselves.这会将数据帧更改为前一个数据点和下一个数据点之间的差异,而不仅仅是日志值本身。

The last part of your question contains too many steps to cover here.您问题的最后一部分包含太多步骤,无法在此处介绍。 It is also not as simple as calling a single function.也不是调用单个function那么简单。 I encourage you to learn more from here我鼓励您从这里了解更多信息

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM