简体   繁体   中英

Preprocessing data for Time-Series prediction

Okay, so I am doing research on how to do Time-Series Prediction. Like always, it's preprocessing the data that's the difficult part. I get I have to convert the "time-stamp" in a data file into a "datetime" or "timestep" I did that.

df = pd.read_csv("airpassengers.csv")
month = pd.to_datatime(df['Month'])

(I may have parse the datatime incorrectly, I seen people use pd.read_csv() instead to parse the data. If I do, please advise on how to do it properly)

I also understand the part where I scale my data. (Could someone explain to me how the scaling works, I know that it turns all my data within the range I give it, but would the output of my prediction also be scaled or something.)

Lastly, once I have scaled and parsed data and timestamps, how would I actually predict with the trained model. I don't know what to enter into (for example) model.predict() I did some research it seemed like I have to shift my dataset or something, I don't really understand what the documentation is saying. And the example isn't directly related to time-series prediction.

I know this is a lot, you might now be able to answer all the questions. I am fairly new to this. Just help with whatever you can. Thank you!

So, because you're working with airpassengers.csv and asking about predictive modeling I'm going to assume you're working through this github

There's a couple of things I want to make sure you know before I dive into the answer to your questions.

  • There are lots of different types of predictive models used in forecasting. You can find all about them here
  • You're asking a lot of broad questions but I'll break down the main questions into two steps and describe what's happening using the example that I believe you're trying to replicate

Let's break it down

Loading and parsing the data

import pandas as pd
import numpy as np
import matplotlib.pylab as plt
%matplotlib inline
from matplotlib.pylab import rcParams
rcParams['figure.figsize'] = 15, 6
air_passengers = pd.read_csv("./data/AirPassengers.csv", header = 0, parse_dates = [0], names = ['Month', 'Passengers'], index_col = 0)

This section of code loads in the data from a.csv (comma-separated values) file. It's saved into the data frame air_passengers. Inside the function to read in the csv we also state that there's a header in the first row, the first column is full of dates, the name of our columns is assigned, we index our data frame to the first column.

Scaling the data

log_air_passengers = np.log(air_passengers.Passengers)

This is done to make the math make sense. Logs are the inverse of exponents (X^2 is the same as Log2X). Using numpy's log function it gives us the natural log (log e). This is also called the natural log. Your predicted values will actually be so close to a percent change that you can use them as such

Now that the data has been scaled, we can prep it for statistical modeling

log_air_passengers_diff = log_air_passengers - log_air_passengers.shift()
log_air_passengers_diff.dropna(inplace=True)

This changes the data frame to be the difference between the previous and next data points instead of just the log values themselves.

The last part of your question contains too many steps to cover here. It is also not as simple as calling a single function. I encourage you to learn more from here

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM