简体   繁体   中英

ValueError: Found array with 0 sample (s) (shape= (0, 1) while a minimum of 1 is required by MinMaxScaler

I am a beginner in ML. I am helping my Math-major friend create a stock predictor with TensorFlow based on a .csv file he provided.

There are a few problems I have. The first one is his .csv file. The file has only dates and closing values, which are not separated, therefore I had to manually separate the dates and values. I've managed to do that, and now I'm having trouble with the MinMaxScaler(). I was told I could pretty much disregard the dates and only test the closing values, normalize them, and make a prediction based off of them.

I keep getting this error:

ValueError: Found array with 0 sample(s) (shape=(0, 1)) while a
minimum of 1 is required by MinMaxScaler()

I honestly have never used SKLearn or TensorFlow before, and it is my first time working on such a project. All the guides I see on the topic utilize pandas, but in my case, the .csv file is a mess and I don't believe I can use pandas for it.

I'm following this DataCamp tutorial :

But unfortunately, due to my lack of experience, some things are not really working for me, and I would appreciate a little more clarity of how I should proceed in my case.

Attached below is my (messy) code:

import pandas as pd
import numpy as np
import tensorflow as tf
import sklearn
from sklearn.model_selection import KFold
from sklearn.preprocessing import scale
from sklearn.preprocessing import MinMaxScaler
import matplotlib
import matplotlib.pyplot as plt
from dateutil.parser import parse
from datetime import datetime, timedelta
from collections import deque

stock_data = []
stock_date = []
stock_value = []
f = open("s&p500closing.csv","r")
data = f.read()
rows = data.split("\n")
rows_noheader = rows[1:len(rows)]

#Separating values from messy `.csv`, putting each value to it's list and also a combined list of both
for row in rows_noheader:
    [date, value] = row[1:len(row)-1].split('\t')
    stock_date.append(date)
    stock_value.append((value))
    stock_data.append((date, value))

#Numpy array of all closing values converted to floats and normalized against the maximum
stock_value = np.array(stock_value, dtype=np.float32)
normvalue = [i/max(stock_value) for i in stock_value]

#Number of closing values and days. Since there is one closing value for each, they both match and there are 4528 of them (each)
nclose_and_days = 0
for i in range(len(stock_data)):
    nclose_and_days+=1

train_data = stock_value[:2264]
test_data = stock_value[2264:]

scaler = MinMaxScaler()

train_data = train_data.reshape(-1,1)
test_data = test_data.reshape(-1,1)

# Train the Scaler with training data and smooth data
smoothing_window_size = 1100
for di in range(0,4400,smoothing_window_size):
    #error occurs here
    scaler.fit(train_data[di:di+smoothing_window_size,:])
    train_data[di:di+smoothing_window_size,:] = scaler.transform(train_data[di:di+smoothing_window_size,:])

# You normalize the last bit of remaining data
scaler.fit(train_data[di+smoothing_window_size:,:])
train_data[di+smoothing_window_size:,:] = scaler.transform(train_data[di+smoothing_window_size:,:])

# Reshape both train and test data
train_data = train_data.reshape(-1)

# Normalize test data
test_data = scaler.transform(test_data).reshape(-1)

# Now perform exponential moving average smoothing
# So the data will have a smoother curve than the original ragged data
EMA = 0.0
gamma = 0.1
for ti in range(1100):
    EMA = gamma*train_data[ti] + (1-gamma)*EMA
    train_data[ti] = EMA

# Used for visualization and test purposes
all_mid_data = np.concatenate([train_data,test_data],axis=0)

window_size = 100
N = train_data.size
std_avg_predictions = []
std_avg_x = []
mse_errors = []

for pred_idx in range(window_size,N):
    std_avg_predictions.append(np.mean(train_data[pred_idx-window_size:pred_idx]))
    mse_errors.append((std_avg_predictions[-1]-train_data[pred_idx])**2)
    std_avg_x.append(date)

print('MSE error for standard averaging: %.5f'%(0.5*np.mean(mse_errors)))

I know that this post is old, but as I stumbled here, others will.. After running in the same problem and googling quite a bit I found a post https://github.com/llSourcell/Make_Money_with_Tensorflow_2.0/issues/7

so it seems that if you download a too small dataset it will throw that error. Download a .csv from 1962 and it'll be big enough ;).

Now,I just have to find the right parameters for my dataset..as I'm adapting this to another type o prediction.. Hope it helps

Your issue isn't your CSV or pandas. You can actually read the CSV with pandas straight into a dataframe which is what I recommend you do. df = pd.read_csv(path)

I am having the same issue with the same code. Whats happening is the Scaler = MinMaxScaler, then in the for di in range part and you are fitting the data in the training set then transforming it and reassigning it back to itself.

The problem is, is its trying to find more data in your training set to fit to the scaler and its running out of data. Which is weird because of the way the tutorial you are following presented it.

The train_data variable has a length of 2264:

train_data = stock_value[:2264]

Then, when you go to fit the scaler, you go outside of train_data 's bounds on the third iteration of the for loop:

smoothing_window_size = 1100
for di in range(0, 4400, smoothing_window_size):

Notice the size of the data set in the tutorial. The training and testing chunks each have length 11,000, and the smoothing_window_size is 2500, so it will never go exceed train_data 's boundaries.

You have a column of all 0's in your data. If you try to scale it the MinMaxScaler can't assign a scale and it trips up. You need to filter out empty/0 columns before you scale the data. Try :

    stock_value=stock_value[:,~np.all(np.isnan(d), axis=0)]

to filter out nan columns in your data

I see that your Window is 1100 and in your for loop you go from 0 to 4400 with 1100 intervals. With that you have 0 as a remainder, which in turn leaves 0 items to normalize, so your code that has

# You normalize the last bit of remaining data
scaler.fit(train_data[di+smoothing_window_size:,:])
train_data[di+smoothing_window_size:,:] = scaler.transform(train_data[di+smoothing_window_size:,:])

You do not need those lines of code, just comment those out. It should work after that

at the top of code i wrote %reset to deal with stackoverflow, it cleared out the memory and i got rid of this 'found array with 0 feature',

#fitting multiple linear regression to the training set
from sklearn.linear_model import LinearRegression
regressor =  LinearRegression()
regressor.fit(X_train,y_train) <------ earlier error occurred here

I have got to apologize, for the whole time you guys were trying to figure out a solution to my issue, I ended up finding a decent guide and taking a much less sophisticated approach (as it was my first ever taste of AI and Statistics). Funny thing is, I was breaking my head for months over this, until I went to a conference in Florida last November and ended finishing it in less than two hours at 3 am in my hotel room.

Here is the finished code I had wrote back then and ended up presenting to my colleague as a working example

import tensorflow as tf
from keras import backend as K

from tensorflow.python.saved_model import builder as saved_model_builder
from tensorflow.python.saved_model import tag_constants, signature_constants, signature_def_utils_impl

from keras.models import Sequential
from keras.layers.core import Dense, Dropout, Activation
from keras.optimizers import SGD
import numpy as np

import matplotlib.pyplot as plt


stock_data = []
stock_date = []
stock_value = []
f = open("s&p500closing.csv","r")
data = f.read()
rows = data.split("\n")
rows_noheader = rows[1:len(rows)]

#Separating values from messy CSV, putting each value to it's list and also a combined list of both
for row in rows_noheader:
    [date, value] = row[1:len(row)-1].split('\t')
    stock_date.append(date)
    stock_value.append((value))
    stock_data.append((date, value))

#Making an array of arrays ready for use with TF,
#slicing array of data to smaller train data
#and normalizing the values against the max for training    
stock_value = np.array(stock_value, dtype=np.float32)
normvalue = [i/max(stock_value) for i in stock_value]
normvalue = np.array(normvalue)
train_data = [np.array(i) for i in normvalue[:500]]
train_data = np.array(train_data)
train_labels = train_data

#First plotting the actual values
plt.plot(normvalue)

#Creating TF session
sess = tf.Session()
K.set_session(sess)
K.set_learning_phase(0)

model_version = "2"
#Declaring the amount of epochs, the amount of periods the machine will learn
#(can play around with it) 
epoch = 20
#Building the model
####################
model = Sequential()
model.add(Dense(8, input_dim=1))
model.add(Activation('tanh'))
model.add(Dense(1))
model.add(Activation('sigmoid'))
sgd = SGD(lr=0.1)

#Compiling and fitting our data to the model
model.compile(loss='binary_crossentropy', optimizer=sgd)
model.fit(train_data, train_labels, batch_size=1, nb_epoch=epoch)

#declaring varaibles for the models input and output to make sure they are all valid
x = model.input
y = model.output

prediction_signature = tf.saved_model.signature_def_utils.predict_signature_def({"inputs": x}, {"prediction":y})

valid_prediction_signature = tf.saved_model.signature_def_utils.is_valid_signature(prediction_signature)
if(valid_prediction_signature == False):
    raise ValueError("Error: Prediction signature not valid!")

#Here the actual prediction of the real values occurs
predictions = model.predict(normvalue)

#Plotting the prediction values
plt.xlabel("Blue: Actual            Orange: Prediction")    
plt.plot(predictions)

Please feel free to make changes and experiment around with it as you deem fit. I would like to thank you all for taking the time to examine my issue and providing a variety of solutions, and am looking forward to learn more in the future :)

I ran into the same error message while writing a unittest for a text classification package based on logistic regression, I realized that it was due to trying to apply a model to an empty df.

In my case, this could happen because my model was actually a tree of models which mirrorred the category tree in my (huge) training data, but only a few of those subcases actually took place in the tiny test df in my unittest.

Long story short: I think the error probably happens because at some point

train_data[di:di+smoothing_window_size,:]

ends up having length 0 in your loop.

my first time commenting on Stackoverflow, please if you find errors on how i answer, or if you find mistakes , please correct me

so for the above error to make the calculation simple, and how you can avoid the above value error.

mid_prices = (high_prices+low_prices)/2.0
print(len(mid_prices))#length 2024

train_data = mid_prices[:1012]
test_data = mid_prices[1012:]

scaler = MinMaxScaler()
train_data = train_data.reshape(-1,1)
test_data = test_data.reshape(-1,1)

smoothing_window_size = 200

for di in range (0,1000,smoothing_window_size):
    scaler.fit(train_data[di:di+smoothing_window_size,:])
    train_data[di:di+smoothing_window_size,:] = scaler.transform(train_data[di:di+smoothing_window_size,:])

this code above works my mid_prices variable has a len of 2024 so my

train_data = mid_prices[:1012]
test_data = mid_prices[1012:]

is split into two 1012 sized chunks

now if you look at the code the tutorial provided

his total size is 22000 he splits them up into two 11000 chunks for test and train and then for the scaler he uses a range from 0 to 10000 in the for loop and a smothing window size of 2500 which , please correct me if i am wrong makes 5 iterations through that 10k set.

using this logic the author used i did this

smoothing_window_size = 200

    for di in range (0,1000,smoothing_window_size):
        scaler.fit(train_data[di:di+smoothing_window_size,:])
        train_data[di:di+smoothing_window_size,:] = scaler.transform(train_data[di:di+smoothing_window_size,:])

which worked perfectly with my set of data and the example that is provided.

i hope this answer suffices to solve this issue

I had almost a similar exception, only that mine read:

"ValueError: Found array with 0 sample(s) (shape=(0, 2)) while a minimum of 1 is required by StandardScaler."

The shape() function denotes the structure of a given dataset. I realized that there was a possibility that my code was not finding any data (and more so the rows denoted by the "0" in the shape()) to analyse and work on. So I simply included the directory that my dataset lied, and voila,, everything worked as expected. I used the glob() function to point to my directory and indicated the files of interest to be analyzed. in my case audio:wav files. Below is the line of code I had to correct:

for wav_file in glob.glob("/home/directory_1/directory_2/*.wav")

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM