Populate column in data frame based on a range found in another dataframe

Question

I'm attempting to populate a column in a data frame based on whether the index value of that record falls within a range defined by two columns in another data frame.

df1 looks like:

and df2 is:

  START STOP CLASS
0   2   3   1
1   5   7   2
2   8   8   3

what I want would look like:

    a   CLASS
0   4   nan
1   45  nan
2   7   1
3   5   1
4   48  nan
5   44  2
6   22  2
7   89  2
8   45  3
9   44  nan
10  23  nan

The START column in df2 is the minimum value of the range and the STOP column is the max.

Answer 1

You can use IntervalIndex (requires v0.20.0).

First construct the index:

df2.index = pd.IntervalIndex.from_arrays(df2['START'], df2['STOP'], closed='both')

df2
Out: 
        START  STOP  CLASS
[2, 3]      2     3      1
[5, 7]      5     7      2
[8, 8]      8     8      3

Now if you index into the second DataFrame it will lookup the value in the intervals. For example,

df2.loc[6]
Out: 
START    5
STOP     7
CLASS    2
Name: [5, 7], dtype: int64

returns the second class. I don't know if it can be used with merge or with merge_asof but as an alternative you can use map:

df1['CLASS'] = df1.index.to_series().map(df2['CLASS'])

Note that I first converted the index to a Series to be able to use the Series.map method. This results in

df1
Out: 
     a  CLASS
0    4    NaN
1   45    NaN
2    7    1.0
3    5    1.0
4   48    NaN
5   44    2.0
6   22    2.0
7   89    2.0
8   45    3.0
9   44    NaN
10  23    NaN

Answer 2

Alternative solution:

classdict = df2.set_index("CLASS").to_dict("index")

rangedict = {}

for key,value in classdict.items():

    # get all items in range and assign value (the key)
    for item in list(range(value["START"],value["STOP"]+1)):
        rangedict[item] = key

extract rangedict:

{2: 1, 3: 1, 5: 2, 6: 2, 7: 2, 8: 3}

now map and possibly format(?):

df1['CLASS'] = df1.index.to_series().map(rangedict)
df1.applymap("{0:.0f}".format)

outputs:

a   CLASS
0   4   nan
1   45  nan
2   7   1
3   5   1
4   48  nan
5   44  2
6   22  2
7   89  2
8   45  3
9   44  nan
10  23  nan

Answer 3

import pandas as pd import numpy as np # Here is your existing dataframe df_existing = pd.DataFrame(np.random.randint(0,100,size=(100, 4)), columns=list('ABCD')) # Create a new empty dataframe with specific column names and data types df_new = pd.DataFrame(index=None) columns = ['field01','field02','field03','field04'] dtypes = [str,int,int,int] for c,d in zip(columns, dtypes): df_new[c] = pd.Series(dtype=d) # Set the index on the new dataframe to same as existing df_new['new_index'] = df_existing.index df_new.set_index('new_index', inplace=True) # Fill the new dataframe with specific fields from the existing dataframe df_new[['field02','field03']] = df_existing[['B','C']] print df_new

Populate column in data frame based on a range found in another dataframe

Question

3 answers

solution1
2 ACCPTED 2017-08-04 22:08:17

solution2
0 2017-08-04 22:21:06

solution3
0 2018-06-29 15:11:03

Populate column in data frame based on a range found in another dataframe

Question

3 answers

solution1 2 ACCPTED 2017-08-04 22:08:17

solution2 0 2017-08-04 22:21:06

solution3 0 2018-06-29 15:11:03

solution1
2 ACCPTED 2017-08-04 22:08:17

solution2
0 2017-08-04 22:21:06

solution3
0 2018-06-29 15:11:03