简体   繁体   中英

How to convert pd.series to np.array, not array of arrays? How to replace nan values?

I struggle to convert selected data from a pd.df to a np.array. Instead I get an array of arrays. I would like to now why I don't get back immediately a normal array, please. I am aware of to_numpy() , but it does not result in the desired outcome. Neither can I replace the nan values. Could you please help me to understand, what is going on, please? Thanks a lot! Have a nice day.

My mini example:

import pandas as pd
import numpy as np

#prepare the example
d={}
d['key1']=np.array([np.nan,2,np.nan,4])
d['key2']=np.array([5,6,7,8])                 
d['key3']=np.array([9,10,11,12])       
print(d)
print(type(d))

# create example df
df=pd.DataFrame(index=[0,1,2,3,4,5],columns=['A','B'])
df.at[0,'A'] = d
df.at[1,'A'] = d
df.at[2,'A'] = d
df.at[3,'A'] = d
df.at[4,'A'] = d
df.at[5,'A'] = d

df

# extract data from selected rows
res1=df.loc[[1,2,3],'A'].apply(lambda x: x.get('key2')).to_numpy()
print(res1)
print(res1.shape) #(3,)
#res1 is an object filled with arrays.
#Why would I not get back immediately an array (3,4), please?

#How can I get a np.array like this, please?
#res2=np.array([[5, 6, 7, 8],[5, 6, 7, 8],[5, 6, 7, 8]])
#res2.shape #(3,4)

# The solution I found:
res3=np.stack(res1,axis=0)
print(res3)
print(type(res3))
print(res3.shape)
#Is there something better that results immediately in a np.ndarray with (3,4)?

#How can I replace the nan values, please?
res4=df.loc[[1,2,3],'A'].apply(lambda x: x.get('key1')).to_numpy(na_value=0)
print(res4) #nan not 0

Thank you.

Edit: I clarified the second question. I would just like to replace the nan's with 0, for example. In the real world example not all arrays for key1 contain a nan. I need to keep the number of elements in each array the same. Sorry. Does somebody understand why my example does give the desired result? Thank you.

try:

res4=df.loc[[1,2,3],'A'].str['key1'].values
#instead of using apply() and lambda use .str['key name'] to get a value of particular key
res4=np.vstack(res4)
#it's similar to np.stack() at axis=0
#Finally:
res4=np.where(pd.isna(res4),0,res4)

output of res4 :

array([[0., 2., 0., 4.],
       [0., 2., 0., 4.],
       [0., 2., 0., 4.]])

Explaination to your question:

The values that you are getting are the Series of numpy's array:

df.loc[[1,2,3],'A'].str['key1']
#output:
1    [nan, 2.0, nan, 4.0]
2    [nan, 2.0, nan, 4.0]
3    [nan, 2.0, nan, 4.0]
Name: A, dtype: object

You can check that by mapping type:

df.loc[[1,2,3],'A'].str['key1'].map(type)
#output:
1    <class 'numpy.ndarray'>
2    <class 'numpy.ndarray'>
3    <class 'numpy.ndarray'>
Name: A, dtype: object

#OR  
#just by:

df.loc[[1,2,3],'A'].str['key1'].values

#output:
array([array([nan,  2., nan,  4.]), array([nan,  2., nan,  4.]),
       array([nan,  2., nan,  4.])], dtype=object)

you will get arrays of array

Note: the na_value parameter in to_numpy() method doesn't work because the values inside your Series are stored in container(np.array in your case)

Also It will not work in case of list , tuple and set because they are also containers(or you can say data structures)

If the values are not stored in a container then na_value=0 will work

consider the following example:

s=pd.Series([5,4,7,np.nan,np.nan])
#Let's say I have this Series
df=pd.DataFrame(data=[[5,4,np.nan,np.nan,6],[2,np.nan,5,np.nan,7]]).T
#And this dataframe
#So the values inside Series and Dataframe are not stored in a container(the datatype is float)

Now I can easily use na_value parameter of to_numpy() method:

s.to_numpy(na_value=0)
#output of above code:
array([5., 4., 7., 0., 0.])
df.to_numpy(na_value=0)
#output of above code:
array([[5., 2.],
       [4., 0.],
       [0., 5.],
       [0., 0.],
       [6., 7.]])

Update:

As I mentioned above the na_value parameter in to_numpy() method doesn't work because the values inside your Series are stored in container(np.array in your case)

you are getting an array of values(array is a container that is holding values) from the dict with key 'key1'

Consider the following example:

d={}
d['key1']=np.array([np.nan,2,np.nan,4])
d['key2']=np.array([5,6,7,8])                 
d['key3']=np.array([9,10,11,12])       
d1={}
d1['key1']=np.nan
d1['key2']=np.array([5,6,7,8])                 
d1['key3']=np.array([9,10,11,12]) 
df=pd.DataFrame(index=[0,1,2,3],columns=['A','B'])
df.at[0,'A'] = d
df.at[1,'A'] = d
df.at[2,'A'] = d1
df.at[3,'A'] = d1

Now if you use na_value parameter you will get:

df['A'].str['key1'].to_numpy(na_value=0)
#output:
array([array([nan,  2., nan,  4.]), array([nan,  2., nan,  4.]), 0, 0],
      dtype=object)
                                           ^nan are not filled because they are inside the container(np.array)                      
                                                                ^nan fill with 0

Note:

If the Series contains real dict then you can use str['keyname'] notation to get the values of that key in a Series and ofcourse It is faster than apply() and anonymous function

First question

Calling to_numpy() then np.stack() seems like the right answer, I can't think of a better or shorter way to do it.

Second question

I'm assuming you want to replace NaNs with zeroes, not remove the values (and change the shape). The following code does it:

res4 = df.loc[[1,2,3],'A'].apply(lambda x: x.get('key1')).to_numpy()
res4 = np.stack(res4)
np.where(np.isnan(res4), 0, res4)

np.isnan produces a boolean mask with identical shape, and np.where puts 0 where it finds True, otherwise it keeps the value from res4 .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM