I struggle to convert selected data from a pd.df to a np.array. Instead I get an array of arrays. I would like to now why I don't get back immediately a normal array, please. I am aware of to_numpy()
, but it does not result in the desired outcome. Neither can I replace the nan values. Could you please help me to understand, what is going on, please? Thanks a lot! Have a nice day.
My mini example:
import pandas as pd
import numpy as np
#prepare the example
d={}
d['key1']=np.array([np.nan,2,np.nan,4])
d['key2']=np.array([5,6,7,8])
d['key3']=np.array([9,10,11,12])
print(d)
print(type(d))
# create example df
df=pd.DataFrame(index=[0,1,2,3,4,5],columns=['A','B'])
df.at[0,'A'] = d
df.at[1,'A'] = d
df.at[2,'A'] = d
df.at[3,'A'] = d
df.at[4,'A'] = d
df.at[5,'A'] = d
df
# extract data from selected rows
res1=df.loc[[1,2,3],'A'].apply(lambda x: x.get('key2')).to_numpy()
print(res1)
print(res1.shape) #(3,)
#res1 is an object filled with arrays.
#Why would I not get back immediately an array (3,4), please?
#How can I get a np.array like this, please?
#res2=np.array([[5, 6, 7, 8],[5, 6, 7, 8],[5, 6, 7, 8]])
#res2.shape #(3,4)
# The solution I found:
res3=np.stack(res1,axis=0)
print(res3)
print(type(res3))
print(res3.shape)
#Is there something better that results immediately in a np.ndarray with (3,4)?
#How can I replace the nan values, please?
res4=df.loc[[1,2,3],'A'].apply(lambda x: x.get('key1')).to_numpy(na_value=0)
print(res4) #nan not 0
Thank you.
Edit: I clarified the second question. I would just like to replace the nan's with 0, for example. In the real world example not all arrays for key1 contain a nan. I need to keep the number of elements in each array the same. Sorry. Does somebody understand why my example does give the desired result? Thank you.
try:
res4=df.loc[[1,2,3],'A'].str['key1'].values
#instead of using apply() and lambda use .str['key name'] to get a value of particular key
res4=np.vstack(res4)
#it's similar to np.stack() at axis=0
#Finally:
res4=np.where(pd.isna(res4),0,res4)
output of res4
:
array([[0., 2., 0., 4.],
[0., 2., 0., 4.],
[0., 2., 0., 4.]])
Explaination to your question:
The values that you are getting are the Series of numpy's array:
df.loc[[1,2,3],'A'].str['key1']
#output:
1 [nan, 2.0, nan, 4.0]
2 [nan, 2.0, nan, 4.0]
3 [nan, 2.0, nan, 4.0]
Name: A, dtype: object
You can check that by mapping type:
df.loc[[1,2,3],'A'].str['key1'].map(type)
#output:
1 <class 'numpy.ndarray'>
2 <class 'numpy.ndarray'>
3 <class 'numpy.ndarray'>
Name: A, dtype: object
#OR
#just by:
df.loc[[1,2,3],'A'].str['key1'].values
#output:
array([array([nan, 2., nan, 4.]), array([nan, 2., nan, 4.]),
array([nan, 2., nan, 4.])], dtype=object)
you will get arrays of array
Note: the na_value
parameter in to_numpy()
method doesn't work because the values inside your Series are stored in container(np.array in your case)
Also It will not work in case of list
, tuple
and set
because they are also containers(or you can say data structures)
If the values are not stored in a container then na_value=0
will work
consider the following example:
s=pd.Series([5,4,7,np.nan,np.nan])
#Let's say I have this Series
df=pd.DataFrame(data=[[5,4,np.nan,np.nan,6],[2,np.nan,5,np.nan,7]]).T
#And this dataframe
#So the values inside Series and Dataframe are not stored in a container(the datatype is float)
Now I can easily use na_value
parameter of to_numpy()
method:
s.to_numpy(na_value=0)
#output of above code:
array([5., 4., 7., 0., 0.])
df.to_numpy(na_value=0)
#output of above code:
array([[5., 2.],
[4., 0.],
[0., 5.],
[0., 0.],
[6., 7.]])
Update:
As I mentioned above the na_value
parameter in to_numpy()
method doesn't work because the values inside your Series are stored in container(np.array in your case)
you are getting an array of values(array is a container that is holding values) from the dict with key 'key1'
Consider the following example:
d={}
d['key1']=np.array([np.nan,2,np.nan,4])
d['key2']=np.array([5,6,7,8])
d['key3']=np.array([9,10,11,12])
d1={}
d1['key1']=np.nan
d1['key2']=np.array([5,6,7,8])
d1['key3']=np.array([9,10,11,12])
df=pd.DataFrame(index=[0,1,2,3],columns=['A','B'])
df.at[0,'A'] = d
df.at[1,'A'] = d
df.at[2,'A'] = d1
df.at[3,'A'] = d1
Now if you use na_value
parameter you will get:
df['A'].str['key1'].to_numpy(na_value=0)
#output:
array([array([nan, 2., nan, 4.]), array([nan, 2., nan, 4.]), 0, 0],
dtype=object)
^nan are not filled because they are inside the container(np.array)
^nan fill with 0
Note:
If the Series contains real dict then you can use str['keyname']
notation to get the values of that key in a Series and ofcourse It is faster than apply()
and anonymous function
Calling to_numpy()
then np.stack()
seems like the right answer, I can't think of a better or shorter way to do it.
I'm assuming you want to replace NaNs with zeroes, not remove the values (and change the shape). The following code does it:
res4 = df.loc[[1,2,3],'A'].apply(lambda x: x.get('key1')).to_numpy()
res4 = np.stack(res4)
np.where(np.isnan(res4), 0, res4)
np.isnan
produces a boolean mask with identical shape, and np.where
puts 0 where it finds True, otherwise it keeps the value from res4
.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.