简体   繁体   中英

Advisable ways to shape my data as input for a RNN

I have a dataframe X, where each row is a data point in time and each column is a feature. The label/target variable Y is univariate. One of the columns of X is the lagged values of Y.

The RNN input is of the shape (batch_size, n_timesteps, n_feature).

From what I've been reading on this site, batch_size should be as big as possible without running out of memory. My main doubt is about n_timesteps. and n_features.

I think n_feature is the number of columns in the X dataframe.

What about the n_timesteps?

Consider the following dataframe with the features temperature, pressure, and humidity:

import pandas as pd
import numpy as np

X = pd.DataFrame(data={
    'temperature': np.random.random((1, 20)).ravel(),
    'pressure': np.random.random((1, 20)).ravel(),
    'humidity': np.random.random((1, 20)).ravel(),
})

print(X.to_markdown())
|    |   temperature |   pressure |   humidity |
|---:|--------------:|-----------:|-----------:|
|  0 |     0.205905  |  0.0824903 | 0.629692   |
|  1 |     0.280732  |  0.107473  | 0.588672   |
|  2 |     0.0113955 |  0.746447  | 0.156373   |
|  3 |     0.205553  |  0.957509  | 0.184099   |
|  4 |     0.741808  |  0.689842  | 0.0891679  |
|  5 |     0.408923  |  0.0685223 | 0.317061   |
|  6 |     0.678908  |  0.064342  | 0.219736   |
|  7 |     0.600087  |  0.369806  | 0.632653   |
|  8 |     0.944992  |  0.552085  | 0.31689    |
|  9 |     0.183584  |  0.102664  | 0.545828   |
| 10 |     0.391229  |  0.839631  | 0.00644447 |
| 11 |     0.317618  |  0.288042  | 0.796232   |
| 12 |     0.789993  |  0.938448  | 0.568106   |
| 13 |     0.0615843 |  0.704498  | 0.0554465  |
| 14 |     0.172264  |  0.615129  | 0.633329   |
| 15 |     0.162544  |  0.439882  | 0.0185174  |
| 16 |     0.48592   |  0.280436  | 0.550733   |
| 17 |     0.0370098 |  0.790943  | 0.592646   |
| 18 |     0.371475  |  0.976977  | 0.460522   |
| 19 |     0.493215  |  0.381539  | 0.995716   |

Now, if you want to use this kind of data for time series prediction with a RNN model, you usually consider one row in the data frame as one timestep. Converting the dataframe into an array might also help you understand what the timesteps are:

print(np.expand_dims(X.to_numpy(), axis=1).shape)
# (20, 1, 3)

First, I obtain an array of the shape (20, 3) or in other words, 20 samples and each sample has three features. I then explicitly add a time dimension to the array, resulting in the shape (20, 1, 3) , meaning that the data set consists of 20 samples and each sample has one time step and for each time step you have 3 features. Now, you can use this data directly as input for a RNN .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM