简体   繁体   中英

Python SKLearn: 'Bad input shape' error when predicting a sequence

I have an Excel file that stores a sequence in each column (reading from top cell to bottom cell), and the trend of the sequence is similar to the previous column. So I'd like to predict the sequence for the nth column in this dataset.

A sample of my data set:

样本数据

See that each column has a set of values / sequence, and they sort of progress as we move to the right, so I want to predict eg the values in the Z column.

Here's my code so far:

import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

# Read the Excel file in rows
df = pd.read_excel(open('vec_sol2.xlsx', 'rb'),
                header=None, sheet_name='Sheet1')
print(type(df))
length = len(df.columns)
# Get the sequence for each row

x_train, x_test, y_train, y_test = train_test_split(
    np.reshape(range(0, length - 1), (-1, 1)), df, test_size=0.25, random_state=0)

print("y_train shape: ", y_train.shape)

pred_model = LogisticRegression()
pred_model.fit(x_train, y_train)
print(pred_model)

I'll explain the logic as much as possible:

  • x_train and x_test will just be the index / column number that is associated with a sequence.
  • y_train is an array of sequences.
  • There is a total of 51 columns, so splitting it with 25% being test data results in 37 train sequences and 13 test sequences.

I've managed to get the shapes of each var when debugging, they are:

  • x_train : (37, 1)
  • x_test : (13, 1)
  • y_train : (37, 51)
  • y_test : (13, 51)

But right now, running the program gives me this error:

ValueError: bad input shape (37, 51)

What is my mistake here?

I don't understand why are you using this:

x_train, x_test, y_train, y_test = train_test_split(
np.reshape(range(0, length - 1), (-1, 1)), df, test_size=0.25, random_state=0)

You have data here in df . Extract X and y from it and then split it to train and test.

Try this:

X = df.iloc[:,:-1]
y = df.iloc[:, -1:]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=0)

Otherwise, the stats you shared shows you are trying to have 51 columned output from one feature, which is weird if you think about it.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM