简体   繁体   中英

Setting the values of a pandas df column based on ranges of values of another df column

I have a df that looks like this:

df = pd.DataFrame({'a':[-3,-2,-1,0,1,2,3], 'b': [1,2,3,4,5,6,7]})

I would like to create a columns 'c' that looks at values of 'a' to determine what operation to do to 'b' and display it in new column 'c'.

I have a solution that uses iterrow, however, my real df is large and iterrows is inefficient.

What I would like to do is do this operation in a vectorized form. My 'slow' solution is:

df['c'] = 0
for index, row in df.iterrows():
    if row['a'] <=-2:
        row['c'] = row['b']*np.sqrt(row[b]*row[a])
    if row['a'] > -2 and row['a'] < 2:
        row['c'] = np.log(row['b'])
    if row['a'] >= 2:
        row['c'] = row['b']**3

Use np.select . It's a vectorized operation.

conditions = [
    df['a'] <= -2,
    (df['a'] > -2) & (df['a'] < 2),
    df['a'] >= 2
]

values = [
    df['b'] * np.sqrt(df['b'] * df['a'])
    np.log(df['b']),
    df['b']**3
]

df['c'] = np.select(conditions, values, default=0)

You can use and.apply across multiple columns in a pandas (specifying axis=1) with a lambda function to get the job done. Not sure if the speed is ok. See this example:

import pandas as pd
import numpy as np

df = pd.DataFrame({'a':[-3,-2,-1,0,1,2,3], 'b': [1,2,3,4,5,6,7]})

def func(a_, b_):
    if a_<=-2:
        return b_*(b_*a_)**0.5
    elif a_<2:
        return np.log(b_)
    else:
        return b_**3.

df['c'] = df[['a','b']].apply(lambda x: func(x[0], x[1]), axis=1)

We can use pd.cut

df.b.pow(pd.cut(df.a,[-np.Inf,-2,2,np.Inf],labels=[2,1,3]).astype(int))
Out[193]: 
0      1
1      4
2      3
3      4
4      5
5      6
6    343
dtype: int64
df['c']=df.b.pow(pd.cut(df.a,[-np.Inf,-2,2,np.Inf],labels=[2,1,3]).astype(int))
df['c'] = df.apply(lambda x: my_func(x), 1)

def my_func(x):
   if x['a'] <= -2:
       return x['b']*np.sqrt(x[b]*x[a])

   # write other conditions as needed

The df.apply function iterates over each row of the dataframe and applies the function passed(ie lambda function ). The second argument is axis which is set to 1 which means it will iterate over rows and row values will be passed into the lambda function. By default it is 0, in this case it will iterate over columns. Lastly, you need to return a value which will be set as column 'c' value.

One method is to index by conditions and then operate on just those rows. Something like this:

df['c'] = np.nan
indices = [
    df['a'] <= -2,
    (df['a'] > -2) & (df['a'] < 2),
    df['a'] >= 2
]
ops = [
    lambda x: x['b'] * np.sqrt(x['b'] * x['a']),
    lambda x: np.log(x['b']),
    lambda x: x['b']**3
]
for ix, op in zip(indices, ops):
    df.loc[ix, 'c'] = op(df)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM