I have vstacked image data and now I wish to split this in a training and test set. However how do I initialize an empty numpy array so I can start vstacking?
My simplified code looks like this:
#k-fold the data
kf = cross_validation.KFold(n, n_folds=2)
fold = 0
for train_ind, test_ind in kf:
#Get the persons of k-fold
train_pers = unique[train_ind]
test_pers = unique[test_ind]
#Set train+test stack to empty
self.train_stack = type(self.pca_data[0])
self.test_stack = type(self.pca_data[0])
#For all test data
for data in range(len(self.pca_data)):
print(self.pca_pers[data])
if self.pca_pers[data] in train_pers:
#Add to train stack
self.train_stack = np.vstack((self.train_stack, self.pca_data[data]))
elif self.pca_pers[data] in test_pers:
#Add to test stack
self.test_stack = np.vstack((self.test_stack, self.pca_data[data]))
else:
#Something wrong
print(data)
sys.exit("Strange strange data")
fold += 1
The import code here is:
#Set train+test stack to empty
self.train_stack = type(self.pca_data)
self.test_stack = type(self.pca_data)
and
#Add to train stack
self.train_stack = np.vstack((self.train_stack, self.pca_data[fold][data]))
self.pca_data contains all the image data, this data has to be distributed over self.train_stack and self.test_stack . I tried the type() function, but this seems to be wrong. I also tried self.train_stack = [] , but this raises the error "ValueError: array dimensions must agree except for d_0". If I would use numpy.zeros, then the first stack are 0's, and I want it to be completely empty before vstacking.
What is the right way to initialize an empty numpy array? (type 'numpy.ndarray')
ps Note that the self.train_stack is in a loop, so an if statement, for if the variable doesn't exist, will not reset the variable when entering the loop for the 2nd time.
Avoid calling np.vstack
in a loop. Each time you do this, a new array is allocated, and all the data from the original aray and the new row is copied into the new array. All that copying makes such a solution slower than necessary.
If we can assume that every row of self.pca_data
belongs in either self.train_stack
or self.test_stack
, then you could replace the entire for-loop
for data in range(len(self.pca_data)):
...
with a call to np.in1d to create a boolean mask, and then define self.train_stack
and self.test_stack
by indexing self.pca_data
using the mask:
for fold, (train_ind, test_ind) in enumerate(kf):
train_pers = unique[train_ind]
mask = np.in1d(self.pca_pers[:,0], train_pers)
self.train_stack = self.pca_data[mask]
self.test_stack = self.pca_data[~mask]
For example, np.in1d
creates a boolean array which is True
when the element in the first array-like is in the second array-like:
In [544]: np.in1d(range(5), [1,2,4])
Out[544]: array([False, True, True, False, True], dtype=bool)
and boolean indexing can be used to select rows like this:
In [545]: mask = np.in1d(range(5), [1,2,4])
In [546]: x = np.arange(10).reshape(5,-1)
In [547]: x
Out[547]:
array([[0, 1],
[2, 3],
[4, 5],
[6, 7],
[8, 9]])
In [548]: x[mask]
Out[548]:
array([[2, 3],
[4, 5],
[8, 9]])
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.