简体   繁体   中英

Apply function to all rows in pandas dataframe (lambda)

I have the following function for getting the column name of last non-zero value of a row

import pandas as pd

def myfunc(X, Y):
    df = X.iloc[Y]
    counter = len(df)-1
    while counter >= 0:
        if df[counter] == 0:
            counter -= 1
        else:
            break
    return(X.columns[counter])

Using the following code example

data = {'id':  ['1', '2', '3', '4', '5', '6'],
        'name': ['AAA', 'BBB', 'CCC', 'DDD', 'EEE', 'GGG'],
        'A1': [1, 1, 1, 0, 1, 1],
        'B1': [0, 0, 1, 0, 0, 1],
        'C1': [1, 0, 1, 1, 0, 0],
        'A2': [1, 0, 1, 0, 1, 0]}

df = pd.DataFrame(data)
df

myfunc(df, 5) # 'B1'

I would like to know how can I apply this function to all rows in a dataframe, and put the results into a new column of df

I am thinking about looping across all rows (which probably is not the good approach) or using lambdas with apply function. However, I have not suceed with this last approach. Any help?

I've modified your function a little bit to work across rows:

def myfunc(row):
     counter = len(row)-1
     while counter >= 0:
         if row[counter] == 0:
             counter -= 1
         else:
             break
     return row.index[counter]

Now just call df.apply your function and axis=1 to call the function for each row of the dataframe:

>>> df.apply(myfunc, axis=1)
0    A2
1    A1
2    A2
3    C1
4    A2
5    B1
dtype: object

However, you can ditch your custom function and use this code to do what you're looking for in a much faster and more concise manner:

>>> df[df.columns[2:]].T.cumsum().idxmax()
0    A2
1    A1
2    A2
3    C1
4    A2
5    B1
dtype: object

Here is a simpler and faster solution using DataFrame.idxmax .

>>> res = df.iloc[:, :1:-1].idxmax(axis=1)
>>> res

0    A2
1    A1
2    A2
3    C1
4    A2
5    B1
dtype: object

The idea is to select only the Ai and Bi columns and reverse the order of them ( df.iloc[:, :1:-1] ) and then return the column label of the first occurrence of maximum (1 in this case) for each row ( .idxmax(axis=1) ).

Note that this solution (as the other answer) assumes that each row contains at least one entry higher than zero.

This assumption can be relaxed to 'each row contains at least one non-zero entry' if we first mask the non-zero entries (using .ne(0) ). This works because .ne(0) produces a boolean mask and True > False <=> 1 > 0 .

>>> res = df.iloc[:, :1:-1].ne(0).idxmax(axis=1)
res

0    A2
1    A1
2    A2
3    C1
4    A2
5    B1
dtype: object

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM