Append array in for loop

Question

I have a dataframe with 1000 rows and 1000 columns. I am trying to generate an numpy array from that dataframe using a for loop, I use the for loop to randomly select 5 columns per cycle. I need to append or concatenate each array (1000 rows and 5 columns) that I generate per cycle. However, it seen that is not possible to create an numpy array without specifying first the dimensions.

I have tried the following code:

import numpy as np
import pandas as pd


df = pd.DataFrame(np.random.choice([0.0, 0.05], size=(1000,1000)))

l =  np.array([])

for i in range(0,100):
 rand_cols = np.random.permutation(df.columns)[0:5]
 df2 = df[rand_cols].copy()
 l = np.append(l, df2, axis=0)

However, I get the following error:

ValueError: all the input arrays must have same number of 
dimensions

This code summarize what I am doing, however, according to this example, the outcome that I need is an array of 1000 rows and 500 columns, that is generated with the concatenation of each of the array I generate with each for loop cycle.

Answer 1

List append is always better than np.append . It is faster, and easier to use correctly.

But let's look at your code in more detail:

In [128]: df = pd.DataFrame(np.random.choice([0.0, 0.05], size=(1000,1000)))    
In [129]: l = np.array([])                                                      
In [130]: rand_cols = np.random.permutation(df.columns)[0:5]                    
In [131]: rand_cols                                                             
Out[131]: array([190, 106, 618, 557, 514])
In [132]: df2 = df[rand_cols].copy()                                            
In [133]: df2.shape                                                             
Out[133]: (1000, 5)
In [134]: l1 = np.append(l, df2, axis=0)                                        
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-134-64d82acc3963> in <module>
----> 1 l1 = np.append(l, df2, axis=0)

/usr/local/lib/python3.6/dist-packages/numpy/lib/function_base.py in append(arr, values, axis)
   4692         values = ravel(values)
   4693         axis = arr.ndim-1
-> 4694     return concatenate((arr, values), axis=axis)
   4695 
   4696 

ValueError: all the input arrays must have same number of dimensions

Since you specified the axis, all np.append is doing is:

np.concatenate([l, df2], axis=0)

l is (0,) shape, df2 is (1000,5). 1d and 2d, hence the complaint about dimensions.

Starting with a 2d l array works:

In [144]: l = np.zeros((0,5))                                                   
In [145]: np.concatenate([l, df2], axis=0).shape                                
Out[145]: (1000, 5)
In [146]: np.concatenate([df2, df2], axis=0).shape                              
Out[146]: (2000, 5)

I think np.append should be deprecated. We see too many SO errors. As your case shows, it is hard to create the correct initial array. np.array([]) only works when building a 1d array. Plus repeated concatenates are slow, creating a completely new array each time.

Answer 2

IIUC

l=[]

for i in range(0,100):
 rand_cols = np.random.permutation(df.columns)[0:5]
 df2 = df[rand_cols].copy()
 l.append(df2.values)


a=np.concatenate(l,1)
a.shape
(1000, 500)

Answer 3

Proposed solution

The reason you're getting this error is that you're trying to append a matrix df2 with shape (1000, 5) to a matrix l with shape (0,) (only one dimension). Problem is, with numpy, the two concatenated matrices must match dimensions AND all the dimensions except the one you're appending to must align ie you should have initialized l with a shape of (0, 5) .

Here's a working version of the code :

import numpy as np
import pandas as pd


df = pd.DataFrame(np.random.choice([0.0, 0.05], size=(1000,1000)))

l =  np.empty(shape=(0, 5))

for _ in range(0,100):
    rand_cols = np.random.permutation(df.columns)[0:5]
    df2 = df[rand_cols]
    l = np.append(l, df2, axis=0)

Suggested improvement

Now, a best practice is to avoid appending matrices within a loop as this is not computationally efficient (a new numpy array has to be created at each iteration, that takes time). You better append the loop iteration's result to a standard python list and wait until the end of the loop execution to stack all the results together.

Here is the code:

import numpy as np
import pandas as pd


df = pd.DataFrame(np.random.choice([0.0, 0.05], size=(1000,1000)))

df_list = []

for _ in range(0,100):
    rand_cols = np.random.permutation(df.columns)[0:5]
    df2 = df[rand_cols]
    df_list += [df2]
l = np.vstack(df_list)

Here I use numpy.vstack to concatenate along the row axis. Other numpy functions with appropriate parameters would give you the same result. Note that there is no need to convert pandas dataframes to numpy arrays.

On my computer, this little improvement reduced the computational time from 164 ms to 107 ms (values picked from a quick execution of each version). Sure this is not that significant here, but I think this is good to know :)

Append array in for loop

Question

3 answers

solution1
4 ACCPTED 2019-04-07 17:32:38

solution2
1 2019-04-07 17:11:57

solution3
0 2019-04-07 17:42:57

Proposed solution

Suggested improvement

Append array in for loop

Question

3 answers

solution1 4 ACCPTED 2019-04-07 17:32:38

solution2 1 2019-04-07 17:11:57

solution3 0 2019-04-07 17:42:57

Proposed solution

Suggested improvement

solution1
4 ACCPTED 2019-04-07 17:32:38

solution2
1 2019-04-07 17:11:57

solution3
0 2019-04-07 17:42:57