简体   繁体   中英

Pandas Dataframe change column values with given interval

I am trying to convert my columns values according to an interval like,

if(x<5)
  x=2
else if(x>=5 %% x<10)
  x=3

and try to doing in python with single line code. Using mask and cut method but I could not do. this is my trial,

dataset['CURRENT_RATIO' ] = dataset['CURRENT_RATIO'].mask((dataset['CURRENT_RATIO'] < 0.02, -7.0) | (dataset['CURRENT_RATIO'] > =0.02 & dataset['CURRENT_RATIO'] < 0.37),-5))

I need this actually if x<0.02 then -7 else if x>=0.02 and x<0.37 then -5...

  inptut output
    0.015  -7
    0.02   -5
    0.37   -3
    0.75   1

TL;DR

A list comprehension will do:

dataset.CURRENT_RATIO = [-7 if i<.02 else -5 if i<.37 else -3 if i<.75 else 1 for i in dataset.CURRENT_RATIO]

Timing it with a random dataset of 1,000,000 rows gave the following result:

334 ms ± 5.89 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Multiple Columns

If you were to do this with multiple columns with different thresholds and update values, I would do the following:

df = pd.DataFrame({'RATIO1': np.random.rand(10000000),
                   'RATIO2': np.random.rand(10000000)})

update_per_col = {'RATIO1': [(.02, -7), (.37, -5), (.75, -3), 1],
                  'RATIO2': [(.12, 5), (.47, 6), (.85, 7), 8]}
cols_to_be_updated = ['RATIO1', 'RATIO1']

for col in cols_to_be_updated:
    df[col] = [update_per_col[col][0][1] if i<update_per_col[col][0][0] else 
               update_per_col[col][1][1] if i<update_per_col[col][1][0] else 
               update_per_col[col][2][1] if i<update_per_col[col][2][0] else update_per_col[col][3]
               for i in df[col]]

When we time the for loop with 10,000,000 rows and two columns we get:

9.37 s ± 147 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

To address Joe Ferndz's comment, lets try to speed it up. I came up with two methods: apply() and a lambda . With apply() we run the following code (only the for loop was timed):

def update_ratio(x, updates):
    if x < updates[0][0]:
        return updates[0][1]
    elif x < updates[1][0]:
        return updates[1][1]
    elif x < updates[2][0]:
        return updates[2][1]
    else:
        return updates[3]

for col in cols_to_be_updated:
    df[col] = df[col].apply(update_ratio, updates=update_per_col[col])

This gives us:

11.8 s ± 285 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Finally the lambda gives:

for col in cols_to_be_updated:
    df[col] = df[col].apply(lambda x: update_per_col[col][0][1] if x<update_per_col[col][0][0] else 
              update_per_col[col][1][1] if x<update_per_col[col][1][0] else 
              update_per_col[col][2][1] if x<update_per_col[col][2][0] else 
              update_per_col[col][3])

8.91 s ± 171 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

This means that a lambda is the fastest method, but a list comprehension is not far off.

use

dataset['CURRENT_RATIO'][dataset['CURRENT_RATIO']<10]=3
dataset['CURRENT_RATIO'][dataset['CURRENT_RATIO']<5]=2

first line you make all value less than 10 to 3 then all values less than 5 to 2 it makes all value between 5 to 10 remain 3

It would be smart to write your own lambda -function and use pandas.apply()

def my_lambda(x):
    if(x<5):
        x=2
    elif(x<10):
        x=3
    return x

dataset['CURRENT_RATIO' ] = dataset['CURRENT_RATIO'].apply(my_lambda)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM