简体   繁体   中英

numpy vstack empty initialization

I have vstacked image data and now I wish to split this in a training and test set. However how do I initialize an empty numpy array so I can start vstacking?

My simplified code looks like this:

#k-fold the data
kf = cross_validation.KFold(n, n_folds=2)
fold = 0
for train_ind, test_ind in kf:
    #Get the persons of k-fold
    train_pers = unique[train_ind]
    test_pers = unique[test_ind]

    #Set train+test stack to empty
    self.train_stack = type(self.pca_data[0])
    self.test_stack = type(self.pca_data[0])

    #For all test data
    for data in range(len(self.pca_data)):
        print(self.pca_pers[data])
        if self.pca_pers[data] in train_pers:
            #Add to train stack
            self.train_stack = np.vstack((self.train_stack, self.pca_data[data]))

        elif self.pca_pers[data] in test_pers:
            #Add to test stack
            self.test_stack = np.vstack((self.test_stack, self.pca_data[data]))
        else:
            #Something wrong
            print(data)
            sys.exit("Strange strange data")

    fold += 1

The import code here is:

#Set train+test stack to empty
self.train_stack = type(self.pca_data)
self.test_stack = type(self.pca_data)

and

#Add to train stack
self.train_stack = np.vstack((self.train_stack, self.pca_data[fold][data]))

self.pca_data contains all the image data, this data has to be distributed over self.train_stack and self.test_stack . I tried the type() function, but this seems to be wrong. I also tried self.train_stack = [] , but this raises the error "ValueError: array dimensions must agree except for d_0". If I would use numpy.zeros, then the first stack are 0's, and I want it to be completely empty before vstacking.

Question

What is the right way to initialize an empty numpy array? (type 'numpy.ndarray')

ps Note that the self.train_stack is in a loop, so an if statement, for if the variable doesn't exist, will not reset the variable when entering the loop for the 2nd time.

Variables

  • self.pca_data: Shape(978, 20) Type(type 'numpy.ndarray')
  • self.pca_pers: Shape(978, 1) Type(type 'numpy.ndarray')
  • self.test_stack and self.train_stack should be for eg Shape(489, 20) and Shape(489, 20) like self.pca_data
  • Other variables you can ignore

Avoid calling np.vstack in a loop. Each time you do this, a new array is allocated, and all the data from the original aray and the new row is copied into the new array. All that copying makes such a solution slower than necessary.

If we can assume that every row of self.pca_data belongs in either self.train_stack or self.test_stack , then you could replace the entire for-loop

for data in range(len(self.pca_data)):
    ...

with a call to np.in1d to create a boolean mask, and then define self.train_stack and self.test_stack by indexing self.pca_data using the mask:

for fold, (train_ind, test_ind) in enumerate(kf):
    train_pers = unique[train_ind]
    mask = np.in1d(self.pca_pers[:,0], train_pers)
    self.train_stack = self.pca_data[mask]
    self.test_stack = self.pca_data[~mask]

For example, np.in1d creates a boolean array which is True when the element in the first array-like is in the second array-like:

In [544]: np.in1d(range(5), [1,2,4])
Out[544]: array([False,  True,  True, False,  True], dtype=bool)

and boolean indexing can be used to select rows like this:

In [545]: mask = np.in1d(range(5), [1,2,4])

In [546]: x = np.arange(10).reshape(5,-1)

In [547]: x
Out[547]: 
array([[0, 1],
       [2, 3],
       [4, 5],
       [6, 7],
       [8, 9]])

In [548]: x[mask]
Out[548]: 
array([[2, 3],
       [4, 5],
       [8, 9]])

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM