简体   繁体   中英

Pandas apply based on column dtypes

I have a sample dataframe on which I am trying to apply based on the column dtype :

df = pd.DataFrame(np.random.randint(0,10,size =(6,2)),columns=["A","B"])
df.loc[2,"B"]=np.NaN
df["C"]=np.NaN
df["st"]=["Mango"]*6
df["date"]=["2001-01-01","2001-01-02","2001-01-03","2001-01-04","2001-01-05","2001-01-06"]
df["date"]=pd.to_datetime(df["date"])
df

Sample dataframe:

    A    B   C  fruit     date
0   1   1.0 NaN Mango   2001-01-01
1   4   3.0 NaN Mango   2001-01-02
2   8   NaN NaN Mango   2001-01-03
3   2   1.0 NaN Mango   2001-01-04
4   9   6.0 NaN Mango   2001-01-05
5   9   6.0 NaN Mango   2001-01-06

I'm trying to transform the DF based on the column dtypes and generate a single row .

pseudocode:

if data_type(column) == String:
   #first value in the column
   return column_value[0]

if data_type(column) == datetime:
   #last value in the column
   return column_value[-1]

if data_type(column) == int or data_type(column) == float:
   if all_values_in_column==np.NaN:
      return np.NaN
   else:
      #mean of the column
      return mean(column)

Code:

from pandas.api.types import is_datetime64_any_dtype as is_datetime
from pandas.api.types import is_float,is_float_dtype,is_integer,is_integer_dtype

def check(series):
   if is_string_dtype(series)==True:
       return series[0]
   elif is_datetime(series) == True:
       return series[len(series)-1]
   elif is_integer_dtype(series) ==True or is_float_dtype(series):
       if series.isnull().all()==True:
           return np.NaN
       else:
           return series.fillna(0).mean()

op = pd.DataFrame(df.apply(check)).transpose()

Current output:

    A   B    C   st         date
0   1   1   NaN Mango   2001-01-01 00:00:00

I am getting the wrong output, except for columns C and st .

Expected output:

    A     B      C   st       date
0   5.5 2.833   NaN Mango   2001-01-06 00:00:00

Any suggestions on the mistake could be helpful?

according to this Why does apply change dtype in pandas dataframe columns
you need to use result_type='expand' in the apply

def check(series):
    if is_string_dtype(series)==True:
        return series[0]
    elif is_datetime(series) == True:
        return series[len(series)-1]
    elif is_integer_dtype(series) ==True or is_float_dtype(series):
        if series.isnull().all()==True:
            return np.NaN
        else:
            return series.fillna(0).mean()        
        
op = pd.DataFrame(df.apply(check, result_type='expand')).transpose()
op

在此处输入图片说明

在此处输入图片说明

A simple solution would be to loop over all columns and save the results in a dictionary, then create a new dataframe. It can be done as follows:

from pandas.api.types import is_datetime64_any_dtype as is_datetime
from pandas.api.types import is_float_dtype, is_integer_dtype

res = dict()
for col, dtype in df.dtypes.items():
    print(col, dtype)
    if is_float_dtype(dtype) or is_integer_dtype(dtype):
        if df[col].isnull().all():
            res[col] = np.nan
        else:
            res[col] = df[col].fillna(0).mean()
    elif dtype == object:
        res[col] = df[col].iloc[0]
    elif is_datetime(dtype):
        res[col] = df[col].iloc[-1]
        
op = pd.DataFrame(res, index=[0])

Result:

      A        B      C  fruit        date
0   5.5 2.833333    NaN  Mango  2001-01-06

Refer df.apply documentation

You are getting this problem because of df.apply, which returns a pandas series of dtype object.

try this:

def check(series):
    print(series.dtype)
    return 0

You'll get:

>>object
>>object
>>object
>>object
>>object

Therefore, instead of using

op = pd.DataFrame(df.apply(check)).transpose()

use

op = pd.DataFrame(df.apply(check), result_type = 'expand').transpose()

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM