I have a dataframe with 1000 rows and 1000 columns. I am trying to generate an numpy array from that dataframe using a for loop, I use the for loop to randomly select 5 columns per cycle. I need to append or concatenate each array (1000 rows and 5 columns) that I generate per cycle. However, it seen that is not possible to create an numpy array without specifying first the dimensions.
I have tried the following code:
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.choice([0.0, 0.05], size=(1000,1000)))
l = np.array([])
for i in range(0,100):
rand_cols = np.random.permutation(df.columns)[0:5]
df2 = df[rand_cols].copy()
l = np.append(l, df2, axis=0)
However, I get the following error:
ValueError: all the input arrays must have same number of
dimensions
This code summarize what I am doing, however, according to this example, the outcome that I need is an array of 1000 rows and 500 columns, that is generated with the concatenation of each of the array I generate with each for loop cycle.
List append is always better than np.append
. It is faster, and easier to use correctly.
But let's look at your code in more detail:
In [128]: df = pd.DataFrame(np.random.choice([0.0, 0.05], size=(1000,1000)))
In [129]: l = np.array([])
In [130]: rand_cols = np.random.permutation(df.columns)[0:5]
In [131]: rand_cols
Out[131]: array([190, 106, 618, 557, 514])
In [132]: df2 = df[rand_cols].copy()
In [133]: df2.shape
Out[133]: (1000, 5)
In [134]: l1 = np.append(l, df2, axis=0)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-134-64d82acc3963> in <module>
----> 1 l1 = np.append(l, df2, axis=0)
/usr/local/lib/python3.6/dist-packages/numpy/lib/function_base.py in append(arr, values, axis)
4692 values = ravel(values)
4693 axis = arr.ndim-1
-> 4694 return concatenate((arr, values), axis=axis)
4695
4696
ValueError: all the input arrays must have same number of dimensions
Since you specified the axis, all np.append
is doing is:
np.concatenate([l, df2], axis=0)
l
is (0,) shape, df2
is (1000,5). 1d and 2d, hence the complaint about dimensions.
Starting with a 2d l
array works:
In [144]: l = np.zeros((0,5))
In [145]: np.concatenate([l, df2], axis=0).shape
Out[145]: (1000, 5)
In [146]: np.concatenate([df2, df2], axis=0).shape
Out[146]: (2000, 5)
I think np.append
should be deprecated. We see too many SO errors. As your case shows, it is hard to create the correct initial array. np.array([])
only works when building a 1d array. Plus repeated concatenates are slow, creating a completely new array each time.
IIUC
l=[]
for i in range(0,100):
rand_cols = np.random.permutation(df.columns)[0:5]
df2 = df[rand_cols].copy()
l.append(df2.values)
a=np.concatenate(l,1)
a.shape
(1000, 500)
The reason you're getting this error is that you're trying to append a matrix df2
with shape (1000, 5) to a matrix l
with shape (0,) (only one dimension). Problem is, with numpy, the two concatenated matrices must match dimensions AND all the dimensions except the one you're appending to must align ie you should have initialized l
with a shape of (0, 5) .
Here's a working version of the code :
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.choice([0.0, 0.05], size=(1000,1000)))
l = np.empty(shape=(0, 5))
for _ in range(0,100):
rand_cols = np.random.permutation(df.columns)[0:5]
df2 = df[rand_cols]
l = np.append(l, df2, axis=0)
Now, a best practice is to avoid appending matrices within a loop as this is not computationally efficient (a new numpy array has to be created at each iteration, that takes time). You better append the loop iteration's result to a standard python list and wait until the end of the loop execution to stack all the results together.
Here is the code:
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.choice([0.0, 0.05], size=(1000,1000)))
df_list = []
for _ in range(0,100):
rand_cols = np.random.permutation(df.columns)[0:5]
df2 = df[rand_cols]
df_list += [df2]
l = np.vstack(df_list)
Here I use numpy.vstack to concatenate along the row axis. Other numpy functions with appropriate parameters would give you the same result. Note that there is no need to convert pandas dataframes to numpy arrays.
On my computer, this little improvement reduced the computational time from 164 ms to 107 ms (values picked from a quick execution of each version). Sure this is not that significant here, but I think this is good to know :)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.