I'm trying to create a NumPy array for the "label" column from a pandas data-frame.
My df:
label vector
0 0 1:0.044509422 2:-0.03092437 3:0.054365806 4:-...
1 0 1:-0.007471546 2:-0.062329583 3:0.012314787 4...
2 0 1:-0.009525825 2:0.0028720177 3:0.0029517233 ...
3 1 1:-0.0040618754 2:-0.03754585 3:0.008025528 4...
4 0 1:0.039150625 2:-0.08689039 3:0.09603256 4:0....
... ... ...
59996 1 1:0.01846487 2:-0.012882819 3:0.035375785 4:-...
59997 1 1:0.01435293 2:-0.00683616 3:0.009475072 4:-0...
59998 1 1:0.018322088 2:-0.017116712 3:0.013021051 4:...
59999 0 1:0.014471473 2:-0.023652712 3:0.031210974 4:...
60000 1 1:0.00888336 2:-0.006902163 3:0.022569133 4:0...
As you can see I'm having two col: label and vector. For the col label I'm using this solution:
y = pd.DataFrame([df.label])
print(y.astype(float).to_numpy())
print(y)
As result I'm having this:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 ... 59985 59986 59987 59988 59989 59990 59991 59992 59993 59994 59995 59996 59997 59998 59999 60000
label 0 0 0 1 0 0 0 0 0 0 0 1 0 1 0 1 ... 1 1 1 0 1 0 0 1 1 1 1 1 1 1 0 1
[1 rows x 60001 columns]
However, the expected output should be:
0
0 0
1 0
2 0
3 1
... ...
[60001 rows x 1 columns]
Instead of an array with [1 rows x 60001 columns]
I would like to have an array with [60001 rows x 1 columns]
Thanks for your time
Instead of an array with [1 rows x 60001 columns] I would like to have an array with [60001 rows x 1 columns] : If I understand your issue correctly and you need to reshape your array use:
y = y.reshape(-1, 1)
This will convert your array into a shape that has one columns and will automatically fix the the number of rows for you (the dimension assigned with -1 is automatically calculated from the arrays size and other dimensions shape). So you can do either of these:
Your proposed way + reshape:
y = pd.DataFrame([df.label]).astype(float).to_numpy().reshape(-1, 1)
Or @cs95's suggested answer (which results in the same array):
y = df[['label']].astype(float).to_numpy()
If you start with a dataframe
In [98]: df
Out[98]:
a b c d
0 0 1 2 3
1 4 5 6 7
2 8 9 10 11
and select a column by name, you get a Series:
In [99]: df.a # df['a']
Out[99]:
0 0
1 4
2 8
Name: a, dtype: int64
In [100]: type(_)
Out[100]: pandas.core.series.Series
the to_numpy
of the series is a 1d array:
In [101]: df.a.to_numpy()
Out[101]: array([0, 4, 8])
In [102]: _.shape
Out[102]: (3,)
But you've taken the Series, and turned it back into a dataframe:
In [103]: y = pd.DataFrame([df.a])
In [104]: y
Out[104]:
0 1 2
a 0 4 8
Was the your intention? In any case, the extracted array is 2d:
In [105]: y.to_numpy()
Out[105]: array([[0, 4, 8]])
In [106]: _.shape
Out[106]: (1, 3)
We can reshape it, or take its 'transpose':
In [107]: __.T # reshape(3,1)
Out[107]:
array([[0],
[4],
[8]])
If we omit the [] from the y
expression, we get a different dataframe and the desired 'column' array:
In [109]: pd.DataFrame(df.a)
Out[109]:
a
0 0
1 4
2 8
In [110]: pd.DataFrame(df.a).to_numpy()
Out[110]:
array([[0],
[4],
[8]])
another option is to select column with a list:
In [111]: df[['a']]
Out[111]:
a
0 0
1 4
2 8
A Series
is the pandas
version of a 1d numpy
array. It has row indices, but no column ones. A DataFrame
is 2d, with rows and columns.
Keep in mind that a numpy
array can have shapes (3,), (1,3) and (3,1), all with the same 3 elements.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.