简体   繁体   English

在 python 中使用 LSTM 添加滞后时的 NaN 值

[英]NaN value when adding lags using LSTM in python

I'm trying to analyze and predict sales based on a dataset, I have already tidied up my data, however, when I try to create lags, the monthly sales lags have values of NaN, what does this NaN mean?我正在尝试根据数据集分析和预测销售额,我已经整理了我的数据,但是,当我尝试创建滞后时,每月销售额滞后的值为 NaN,这个 NaN 是什么意思? From the tutorial I'm referring, he doesn't have these NaN value, at least when he drops NaN values, he still have some output but in my case, I do not have anything when I drop NaN values...从我指的教程中,他没有这些 NaN 值,至少当他删除 NaN 值时,他仍然有一些输出,但在我的情况下,当我删除 NaN 值时,我什么都没有......

from __future__ import division
from datetime import datetime, timedelta, date
import pandas as pd
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np

import warnings
warnings.filterwarnings("ignore")

import plotly.plotly as py
import plotly.offline as pyoff
import plotly.graph_objs as go

import keras
from keras.layers import Dense
from keras.models import Sequential
from keras.optimizers import Adam 
from keras.callbacks import EarlyStopping
from keras.utils import np_utils
from keras.layers import LSTM
from sklearn.model_selection import KFold, cross_val_score, train_test_split

#initiate plotly
pyoff.init_notebook_mode()

#read data
df = pd.read_csv(r"C:\Users\User\Desktop\UOW\Yr3\FYP\Sample.csv", encoding='latin-1')

df['Order Date'] = pd.to_datetime(df['Order Date'])

df.head(10)

# Drop empty cells
df.dropna(axis=0, how='all', thresh=None, subset=None, inplace=False)
df.shape

# Drop unwanted columns
# Order ID, Ship Date, Ship Mode, Segment, Country, City, State, Postal Code, Region, Product ID, 
Category, Sub-Category, Product Name,

# Discount
df_sales = df.drop(['Order ID', 'Segment', 'Country', 'City', 'State', 'Postal Code', 'Region', 'Product ID', 'Category', 'Sub-Category', 'Product Name','Discount'], axis = 1)
df_sales.head(10)

# represent month in date field as its first day
df_sales['Order Date'] = pd.to_datetime(df_sales['Order Date']).dt.strftime("%Y-%m-%d")
df_sales = df_sales.groupby('Order Date').Sales.sum().reset_index()
df_sales

#plot monthly sales
 plot_data = [
    go.Scatter(
        x=df_sales['Order Date'],
        y=df_sales['Sales'],
    )
]
plot_layout = go.Layout(
         title='Montly Sales'
    )
fig = go.Figure(data=plot_data, layout=plot_layout)
pyoff.iplot(fig)

# Create a new dataframe to model the difference
df_diff = df_sales.copy()

# Add previous sales to the next row
df_diff['Prev_Sales'] = df_diff['Sales'].shift(1)

# Drop the null values and calculate the difference
df_diff = df_diff.dropna()
df_diff['diff'] = (df_diff['Sales'] - df_diff['Prev_Sales'])

df_diff.head(10)

#plot sales diff
plot_data = [
    go.Scatter(
        x=df_diff['Order Date'],
        y=df_diff['diff'],)]

plot_layout = go.Layout(
        title='Montly Sales Difference')

fig = go.Figure(data=plot_data, layout=plot_layout)
pyoff.iplot(fig)

#create dataframe for transformation from time series to supervised
df_supervised = df_diff.drop(['Prev_Sales'],axis=1)

#adding lags
for inc in range(1,13):
    field_name = 'lag_' + str(inc)
    df_supervised[field_name] = df_supervised['diff'].shift(inc)

#drop null values
#df_supervised = df_supervised.dropna().reset_index(drop=True)***

df_supervised

then I the output I get is然后我得到的输出是

Order Date |订购日期 | Sales |销售 | diff |差异| lag_1 |滞后_1 | lag_2 |滞后_2 | lag_3 |滞后_3 | lag_4 |滞后_4 | lag_5 |滞后_5 | lag_6 |滞后_6 | lag_7 |滞后_7 | lag_8 |滞后_8 | lag_9 |滞后_9 | lag_10 |滞后_10 | lag_11 |滞后_11 | lag_12滞后_12

1 2019-02-01 | 1 2019-02-01 | 333904.9556 | 333904.9556 | -30136.6174 | -30136.6174 | NaN |南| NaN |南| NaN |南| NaN |南| NaN |南| NaN |南| NaN |南| NaN |南| NaN |南| NaN |南| NaN |南| NaN NaN

2 2019-03-01 | 2 2019-03-01 | 361431.8218 | 361431.8218 | 27526.8662 | 27526.8662 | -30136.6174 | -30136.6174 | NaN |南| NaN |南| NaN |南| NaN |南| NaN |南| NaN |南| NaN |南| NaN |南| NaN |南| NaN |南| NaN NaN

3 2019-04-01 | 3 2019-04-01 | 359930.1225 | 359930.1225 | -1501.6993 | -1501.6993 | 27526.8662 | 27526.8662 | -30136.6174 | -30136.6174 | NaN |南| NaN |南| NaN |南| NaN |南| NaN |南| NaN |南| NaN |南| NaN |南| NaN |南| NaN NaN

4 2019-05-01 | 4 2019-05-01 | 348999.4696 | 348999.4696 | -10930.6529 | -10930.6529 | -1501.6993 | -1501.6993 | 27526.8662 | 27526.8662 | -30136.6174 | -30136.6174 | NaN |南| NaN |南| NaN |南| NaN |南| NaN |南| NaN |南| NaN |南| NaN |南| NaN NaN

5 2019-06-01 | 5 2019-06-01 | 372904.5441 | 372904.5441 | 23905.0745 | 23905.0745 | -10930.6529 | -10930.6529 | -1501.6993 | -1501.6993 | 27526.8662 | 27526.8662 | -30136.6174 | -30136.6174 | NaN |南| NaN |南| NaN |南| NaN |南| NaN |南| NaN |南| NaN |南| NaN NaN

6 2019-07-01 | 6 2019-07-01 | 372936.2013 | 372936.2013 | 31.6572 | 31.6572 | 23905.0745 | 23905.0745 | -10930.6529 | -10930.6529 | -1501.6993 | -1501.6993 | 27526.8662 | 27526.8662 | -30136.6174 | -30136.6174 | NaN |南| NaN |南| NaN |南| NaN |南| NaN |南| NaN |南| NaN NaN

7 2019-08-01 | 7 2019-08-01 | 328648.3505 | 328648.3505 | -44287.8508 | -44287.8508 | 31.6572 | 31.6572 | 23905.0745 | 23905.0745 | -10930.6529 | -10930.6529 | -1501.6993 | -1501.6993 | 27526.8662 | 27526.8662 | -30136.6174 | -30136.6174 | NaN |南| NaN |南| NaN |南| NaN |南| NaN |南| NaN NaN

8 2019-09-01 | 8 2019-09-01 | 371825.2898 | 371825.2898 | 43176.9393 | 43176.9393 | -44287.8508 | -44287.8508 | 31.6572 | 31.6572 | 23905.0745 | 23905.0745 | -10930.6529 | -10930.6529 | -1501.6993 | -1501.6993 | 27526.8662 | 27526.8662 | -30136.6174 | -30136.6174 | NaN |南| NaN |南| NaN |南| NaN NaN

9 2019-10-01 | 9 2019-10-01 | 363781.0459 | 363781.0459 | -8044.2439 | -8044.2439 | 43176.9393 | 43176.9393 | -44287.8508 | -44287.8508 | 31.6572 | 31.6572 | 23905.0745 | 23905.0745 | -10930.6529 | -10930.6529 | -1501.6993 | -1501.6993 | 27526.8662 | 27526.8662 | -30136.6174 | -30136.6174 | NaN NaN
| | NaN |南| NaN |南| NaN NaN

10 2019-11-01 | 10 2019-11-01 | 336836.8240 | 336836.8240 | -26944.2219 | -26944.2219 | -8044.2439 | -8044.2439 | 43176.9393 | 43176.9393 | -44287.8508 | -44287.8508 | 31.6572 | 31.6572 | 23905.0745 | 23905.0745 | -10930.6529 | -10930.6529 | -1501.6993 | -1501.6993 | 27526.8662 | 27526.8662 | -30136.6174 | -30136.6174 | NaN |南| NaN |南| NaN NaN

11 2019-12-01 | 11 2019-12-01 | 374106.0722 | 374106.0722 | 37269.2482 | 37269.2482 | -26944.2219 | -26944.2219 | -8044.2439 | -8044.2439 | 43176.9393 | 43176.9393 | -44287.8508 | -44287.8508 | 31.6572 | 31.6572 | 23905.0745 | 23905.0745 | -10930.6529 | -10930.6529 | -1501.6993 | -1501.6993 | 27526.8662 | 27526.8662 | -30136.6174 | -30136.6174 | NaN |南| NaN NaN

If I uncomment out this code : df_supervised = df_supervised.dropna().reset_index(drop=True) it will show an output of nothing but the titles如果我取消注释此代码: df_supervised = df_supervised.dropna().reset_index(drop=True)它只会显示标题的输出

Order Date |订购日期 | Sales |销售 | diff |差异| lag_1 |滞后_1 | lag_2 |滞后_2 | lag_3 |滞后_3 | lag_4 |滞后_4 | lag_5 |滞后_5 | lag_6 |滞后_6 | lag_7 |滞后_7 | lag_8 |滞后_8 | lag_9 |滞后_9 | lag_10 |滞后_10 | lag_11 |滞后_11 | lag_12滞后_12

Anyone can help me with this issue?任何人都可以帮我解决这个问题吗? Thank you so much!非常感谢!

NaN refers to Not A Number. NaN 是指非数字。

It is usual to have a NaN when using lag times.使用滞后时间时通常有一个 NaN。

You should try to fill the NaNs instead of dropping them if you want to retain your data.如果您想保留您的数据,您应该尝试填充 NaN 而不是删除它们。

Eg df.fillna(0)例如df.fillna(0)

You can start by having a look here: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.fillna.html你可以先看看这里: https : //pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.fillna.html

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM