简体   繁体   中英

Assign values to a pandas dataframe column based on intervals

I have a dataframe ( df ) with a single column of dates and a second dataframe ( df_value ) with three columns: a start date ('From'), an end date ('To') and an associated value. I want to create a second column in df with the correct value which has been looked up from df_value :

import pandas as pd
df = pd.DataFrame(['30/03/2018', '01/10/2019','03/07/2020', '05/08/2020', '06/08/2020', '10/10/2020'], columns=['Date'])
df['Date'] = pd.to_datetime(df['Date'], format='%d/%m/%Y', dayfirst=True).dt.date

df_value = pd.DataFrame([['01/01/2018','31/12/2018',1.286], ['01/01/2019','30/06/2019',1.555], ['01/07/2019','31/12/2019',1.632], ['01/01/2020','31/12/2020',1.864]], columns =['From', 'To', 'Value'])
df_value['From'] = pd.to_datetime(df_value['From'], format='%d/%m/%Y', dayfirst=True).dt.date
df_value['To'] = pd.to_datetime(df_value['To'], format='%d/%m/%Y', dayfirst=True).dt.date

At the moment I have done this through applying the function below to df row-by-row. Although this works I feel that there must be a far more efficient way of doing this:

def fixed_func(df):    
        value = 0
        row_counter = 0
                    
        while value == 0:
            if (df['Date']>= df_value.iloc[row_counter, 0]) & (df['Date']<= df_value.iloc[row_counter, 1]):
                value = df_value.iloc[row_counter, 2]
            else:
                row_counter += 1
            
        return value

df['Value'] = df.apply(fixed_func, axis=1)

You can use numpy.where :

import numpy as np

date_col = df.Date.values[:, np.newaxis]
x,y = np.where((date_col >= df_value.From.values) & 
               (date_col <= df_value.To.values))

df["Value"] = df_value.loc[y,"Value"].reset_index(drop = True)
#        Date  Value
#0 2018-03-30  1.286
#1 2019-10-01  1.632
#2 2020-07-03  1.864
#3 2020-08-05  1.864
#4 2020-08-06  1.864
#5 2020-10-10  1.864

Here is a detailed answer, comparing your current approach and another proposed by me:

# Setup data
import pandas as pd
import numpy as np
df = pd.DataFrame(['30/03/2018', '01/10/2019','03/07/2020', '05/08/2020', '06/08/2020', '10/10/2020'], columns=['Date'])
df['Date'] = pd.to_datetime(df['Date'], format='%d/%m/%Y', dayfirst=True).dt.date
print(df)

df_value = pd.DataFrame([['01/01/2018','31/12/2018',1.286], ['01/01/2019','30/06/2019',1.555], ['01/07/2019','31/12/2019',1.632], ['01/01/2020','31/12/2020',1.864]], columns =['From', 'To', 'Value'])

df_value['From'] = pd.to_datetime(df_value['From'], format='%d/%m/%Y', dayfirst=True).dt.date
df_value['To'] = pd.to_datetime(df_value['To'], format='%d/%m/%Y', dayfirst=True).dt.date
print(df_value)

df2 = df.copy()
df3 = df.copy()

Now defining your approach as a function:

def existing():
    #df = pd.concat([df, df_value], axis=1)
    def fixed_func(df):    
        value = 0
        row_counter = 0
                        
        while value == 0:
            if (df['Date']>= df_value.iloc[row_counter, 0]) & (df['Date']<= df_value.iloc[row_counter, 1]):
                value = df_value.iloc[row_counter, 2]
            else:
                row_counter += 1
                
        return value
    df2['Value'] = df2.apply(fixed_func, axis=1)

Here is a version proposed by me:

def proposed():
    def rangecheck(dt):
        t = df_value['Value'][(df_value['From']<=dt) & (dt<=df_value['To'])].tolist()[0]
        return t
        
    df3['Value'] = df3['Date'].map(rangecheck)

The timing profile for both the methods is as follows:

 [IN]> %timeit existing()
[OUT]> 1.68 ms ± 22.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
 [IN]> %timeit proposed()
[OUT]> 1.97 ms ± 23.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Thus, my approach is not better than your existing one, but, looks pythonic:)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM